
DetailTTS: Learning Residual Detail Information for Zero-Shot Text-to-Speech기존 text-to-speech system은 linguistic, acoustic detail을 omission 하는 경우가 많음DetailTTSConditional Variational AutoEncoder를 기반으로 하는 zero-shot text-to-speech modelAlignment 과정에서 missed residual detail information을 capture 하는 Prior Detail module과 Duration Detail module을 도입논문 (ICASSP 2025) : Paper Link1. IntroductionZero-shot Te..

UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal PromptsEmotional Text-to-Speech (TTS)는 oversimplified emotional label이나 single-modality input에 의존하므로 human emotion을 효과적으로 반영하지 못함UMETTSEmotion Prompt Alignment module과 Emotion Embedding-Induced TTS module을 활용하여 multiple modality의 emotional cue를 반영Emotion Prompt Alignment module은 contrastive learning을 통해 text, audi..

VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via AutoguidanceSpeaker adaptive text-to-speech model에 paramter-efficient fine-tuning을 적용하는 경우, out-of-domain speaker에 대한 adaptation performance의 한계가 있음VoiceGuiderAutoguidance로 reinforce 된 speaker adaptive text-to-speech modelAutoguidance strengthening strategy를 통해 out-of-domain data에 대한 robus..

NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple SpeakersMultiple speaker에 대한 adapter를 활용하여 personalized text-to-speech model을 구성할 수 있음NanoVoiceMultiple reference를 parallel fine-tuning 할 수 있는 batch-wise speaker adaptation을 활용추가적으로 speaker adaptation parameter를 줄이기 위해 parameter sharing을 도입하고, trainable scale matrix를 incorporate논문 (ICASSP 2025) : Paper Link1. IntroductionVALL-E, V..

SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified FlowFlow matching-based speech synthesis model은 inference step을 줄이면서 speech quality를 향상할 수 있음SlimSpeechRectified flow model을 기반으로 parameter 수를 줄이고 teacher model로 활용Reflow operation을 refine 하여 straight sampling trajectory를 가지는 smaller model을 directly derive 하고 distillation method를 통해 성능을 향상논문 (ICASSP 2025) : Paper Link1. Int..

Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference OptimizationEmotional Text-to-Speech는 주로 supervised training을 사용하여 text와 desired emotion을 emotional speech로 변환함- BUT, 단순히 correct emotional output만을 학습하므로 emotion 간의 nuance를 capture 하지 못함Emo-DPOPreferred emotion을 optimizing 하여 emotional nuance를 differentiate 하는 Direct Preference Optimization을 활용Emotion-aware Large Languag..