Evidential-TTS: High Fidelity Zero-Shot Text-to-Speech Using Evidential Deep LearningZero-shot text-to-speech를 위해 Evidential Deep Learning을 활용할 수 있음Evidiential-TTSIterative Parallel Decoding을 사용하여 aligned phoneme sequence를 acoustic token으로 convert Evidential Deep Learning optimization에 기반한 model uncertainty를 도입해 high quality speech generation을 위한 reliable sampling path를 제공논문 (ICASSP 2025) : Pape..
LEF-TTS: Lightweight and Efficient End-to-End Text-to-Speech Synthesis with Multi-Stream Generator최근에는 lightweight, efficient Text-to-Speech model의 요구가 증가하고 있음LEF-TTSEfficientTTS2를 기반으로 Single Head Fast Linear Attention을 적용ConvWaveNet과 multi-stream iSTFT generator를 도입해 inference speed를 개선논문 (ICASSP 2025) : Paper Link1. IntroductionFastSpeech, FastSpeech2와 같은 two-stage TTS model에 비해 VITS와 같은 end-..
Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody PromptingSpeaker-adaptive text-to-speech model은 target speech sample에 sensitive 함Stable-TTSHigh-quality pre-training dataset의 subset인 prior sample의 prosody를 활용하여 target speaker timbre를 효과적으로 반영Fine-tuning 시 prior-preservation loss를 활용하여 target sample에 대한 overfitting을 방지논문 (ICASSP 2025) : Paper Link1. IntroductionYourTTS, VA..
DetailTTS: Learning Residual Detail Information for Zero-Shot Text-to-Speech기존 text-to-speech system은 linguistic, acoustic detail을 omission 하는 경우가 많음DetailTTSConditional Variational AutoEncoder를 기반으로 하는 zero-shot text-to-speech modelAlignment 과정에서 missed residual detail information을 capture 하는 Prior Detail module과 Duration Detail module을 도입논문 (ICASSP 2025) : Paper Link1. IntroductionZero-shot Te..
UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal PromptsEmotional Text-to-Speech (TTS)는 oversimplified emotional label이나 single-modality input에 의존하므로 human emotion을 효과적으로 반영하지 못함UMETTSEmotion Prompt Alignment module과 Emotion Embedding-Induced TTS module을 활용하여 multiple modality의 emotional cue를 반영Emotion Prompt Alignment module은 contrastive learning을 통해 text, audi..
VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via AutoguidanceSpeaker adaptive text-to-speech model에 paramter-efficient fine-tuning을 적용하는 경우, out-of-domain speaker에 대한 adaptation performance의 한계가 있음VoiceGuiderAutoguidance로 reinforce 된 speaker adaptive text-to-speech modelAutoguidance strengthening strategy를 통해 out-of-domain data에 대한 robus..
