
RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow MatchingOrdinary Differential Equation 기반의 Text-to-Speech는 quality와 inference speed 간의 trade-off가 존재함RapFlow-TTSConsistenct quality를 위해 Flow Matching-Straightened Ordinary Differential Equation trajectory를 따라 velocity field의 consistency를 enforceFew-step synthesis의 quality를 향상하기 위해 time interval scheduling, adversa..

MPE-TTS: Customized Emotion Zero-Shot Text-to-Speech Using Multi-Modal PromptMulti-modal prompt를 zero-shot Text-to-Speech에 활용할 수 있음MPE-TTS다양한 prompt에서 emotion information을 추출하기 위해 Multi-Modal Prompt Emotion Encoder를 도입추가적으로 prosody predictor와 emotion consistency loss를 적용논문 (INTERSPEECH 2025) : Paper Link1. IntroductionZero-Shot Text-to-Speech (ZS-TTS)는 unseen style의 speech를 생성하는 것을 목표로 함Speech-b..

Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-SpeechExpressive Text-to-Speech는 여전히 한계가 있음Spotlight-TTS서로 다른 speech region의 continuity를 maintain 하는 Voiced-Aware Style Extraction을 도입추가적으로 추출된 style의 direction을 adjust 하여 speech quality를 향상논문 (INTERSPEECH 2025) : Paper Link1. IntroductionText-to-Speech (TTS)는 input text에..

DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech기존의 emotional Text-to-Speech model은 speaker, emotion characteristic을 fully separate 하지 못함DiEmo-TTSEmotional attribute prediction과 speaker embedding을 사용한 emotion clustering을 도입Style feature를 integrate하는 dual conditioning Transformer를 활용논문 (INTERSPEECH 2025) : Paper ..

EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech SynthesisEmotional Text-to-Speech는 여전히 intensity control 측면에서 한계가 있음EmoMixEmotion embedding을 추출하기 위해 pre-trained Speech Emotion Recognition model을 활용Run-time 시 diffusion model을 기반으로 mixed emotion synthesis를 수행논문 (INTERSPEECH 2023) : Paper Link1. IntroductionGenerSpeech와 같은 emotional Text-to-Speech (TTS) model은 reference-based style..

OZSpeech: One-Step Zero-Shot Speech Synthesis with Learned-Prior-Conditioned Flow MatchingWaveform, spectrogram과 같은 기존의 speech representation은 speech attribute를 overlooking 하고 high computational cost를 가짐OZSpeechOne-step sampling과 learned prior를 condition으로 사용하여 sampling step 수를 reduceToken format의 disentangled, factorized component를 활용하여 speech attributre를 modeling논문 (ACL 2025) : Paper Link1. In..