
Factorized-VITS: Decoding Prosody and Text in End-to-End Speech Synthesis without External or Secondary AlignerExplicit text-side prosody modeling을 incorporate 하면 end-to-end text-to-speech 성능을 향상할 수 있음Factorized-VITSAudio prior hidden space를 text, prosody subspace로 clean factorizeExtra parameter 없이 factorized text subspace에서 on-the-fly alignment를 수행논문 (ICASSP 2025) : Paper Link1. IntroductionTex..

VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-SpeechDecoder-only text-to-speech model은 monotonic alignment constraint가 부족하여 mispronunciation, word skipping, repeating 등의 문제가 발생함VALL-TDecoder-only Transformer를 유지하면서 input phoneme sequence에 대한 relative position embedding을 도입Monotonic generation process를 explicitly indicate 하여 zero-shot text-to-speech에 대한 r..

WavTokenizer: An Efficient Acoustic Discrete Codec Tokenizer for Audio Language ModelingLanguage model은 high-dimensional natural signal을 lower-dimensional discrete token으로 compress 하는 tokenizer를 활용함WavTokenizerQuantizer layer와 discrete codec의 temporal dimension을 compressBroader VQ space, contextual window extending, inverse Fourier transform structure를 통해 더 나은 reconstruction quality와 richer sema..

SpeechTokenizer: Unified Speech Tokenizer for Speech Language ModelsSpeech language model은 semantic, acoustic token과 같은 discrete speech representation을 기반으로 구축됨SpeechTokenizerSpeech token이 speech language model에 적합한지를 evaluate 하기 위해 SLMTokBench를 도입Residual Vector Quantization에 기반한 encoder-decoder architecture를 채택하여 unified speech tokenizer를 구성 논문 (ICLR 2024) : Paper Link1. IntroductionSpeech Lan..

AdaptVC: High Quality Voice Conversion with Adaptive LearningVoice conversion을 위해서는 source에서 disentangled linguistic content를 추출하고 reference에서 voice style을 추출할 수 있어야 함AdaptVCAdapter를 활용하여 self-supervised speech feature를 tuning 해 content, speaker를 효과적으로 disentangleCross-attention speaker conditioning과 conditional flow matching을 활용하여 synthesis quality를 향상논문 (ICASSP 2025) : Paper Link1. Introductio..

FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised LearningSelf-supervised learning은 computational cost 측면에서 한계가 있음FitHuBERTTime-Reduction layer를 사용하여 inference time을 개선Hint-based Distillation을 통해 performance degradation을 방지논문 (INTERSPEECH 2022) : Paper Link1. IntroductionLarge-scale speech Self-Supervised Learning (SSL)은 speech-only data를 pre-training에 활용할 ..