
BridgeVoC: Neural Vocoder with Shrodinger BridgeDiffusion-based neural vocoder는 mel-spectrogram의 linear-degradation을 neglect 함BridgeVoCTime-Frequency domain-based neural vocoder와 Schrodinger Bridge를 연결Mel-spectrogram을 target linear-scale domain으로 project 하고 degraded spectral representation으로 취급논문 (IJCAI 2025) : Paper Link1. IntroductionNeural vocoder는 acoustic feature로부터 high-quality waveform을 생..

PALLE: Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech SynthesisZero-Shot Text-to-Speech에서 autoregressive model은 generation speed, non-autoregressive model은 temporal modeling의 한계가 있음PALLEAutoregressive의 explicit temporal modeling과 non-autoregressive의 parallel genertion을 combine 한 pseudo-autoregressive approach를 도입Two-stage framework를 기반으로 first stage에서는 ..

RNDVoC: Learning Neural Vocoder from Range-Null Space DecompositionNeural vocoder는 parameter-performance trade-off가 존재함RNDVoCRange-Null Decomposition과 vocoder task를 bridge 하여 target spectrogram reconstruction을 range-space와 null-space 간의 superimposition으로 decompose추가적으로 sub-band, sequential modeling을 위해 cross-/narrow-band module을 활용한 dual-path framework를 구성논문 (IJCAI 2025) : Paper Link1. Introduct..

FELLE: Autoregressive Speech Synthesis with Token-wise Coarse-to-Fine Flow MatchingLanguage modeling과 flow matching을 integrate 할 수 있음FELLELanguage model의 autoregressive nature와 flow matching의 generative efficacy를 기반으로 continuous-valued token을 predict추가적으로 coarse-to-fine flow matching mechanism을 통해 speech quality를 향상논문 (MM 2025) : Paper Link1. IntroductionVALL-E, VALL-E2와 같은 Large Language Model ..

PAST: Phonetic-Acoustic Speech TokenizerSignal reconstruction과 phonetic information을 jointly modeling 할 수 있음PASTPre-trained self-supervised model 없이 supervised phonetic data를 사용하여 auxiliary task를 통해 domain knowledge를 tokenization process에 integrate추가적으로 real-time application을 위한 streamable architecture를 구성논문 (INTERSPEECH 2025) : Paper Link1. IntroductionSpeech language model은 일반적으로 acoustic toke..

REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice ConversionSpeech Time Reversal은 speaker identification을 위한 tonal pattern을 가지고 있음REWINDTime-reversed speech에서 학습된 speaker representation을 활용한 augmentation strategy를 도입Diffusion-based voice conversion model에 적용하여 speaker의 unique vocal trait를 preserve 하면서 linguistic content의 interference를 minimize논문 (INTERSP..