ParaStyleTTS: Toward Efficient and Robust Paralinguistic Style Control for Expressive Text-to-Speech GenerationText-to-Speech system에서 speaking style control은 여전히 한계가 있음ParaStyleTTSProsodic, paralinguistic speech style modeling을 separate 하는 2-level style adaptation architecture를 도입추가적으로 low-resource deployment와 다양한 prompt formulation에 대한 consistent style을 유지논문 (CIKM 2025) : Paper Link1. Introduc..
DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech SynthesisDiffusion model은 iterative denoising process로 인해 computationally intensive 함DMOSpeechDistilled diffusion-based model을 활용하여 teacher 보다 더 빠른 추론 속도를 달성Connectionist Temporal Classification, Speaker Verification loss에 대한 end-to-end optimization을 지원논문 (ICML 2025) : Paper Link1. IntroductionSpeechX, MaskGC..
FillerSpeech: Towards Human-Like Text-to-Speech Synthesis with Filler Insertion and Filler Style ControlHuman-like conversational speech synthesis를 위해서는 natural filler insertion이 가능해야 함FillerSpeechFiller style을 tokenize 하고 input text에 대한 cross-attention을 적용추가적으로 natural filler insertion이 가능한 Large Language Model-based filler prediction을 도입논문 (EMNLP 2025) : Paper Link1. IntroductionHierSpeech, Vo..
ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching기존의 large-scale text-to-speech model은 massive parameter로 인해 추론 속도가 느림ZipVoiceZipformer-based vector field estimator, text encoder를 도입하고 average upsampling-based initial speech-text alignment를 활용추가적으로 sampling step을 줄이기 위해 flow distillation method를 도입논문 (ASRU 2025) : Paper Link1. IntroductionVALL-E, VoiceBox, MaskGCT와 같은 z..
DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable StylesRich, flexible prosodic variation을 위해서는 text-to-prosody의 one-to-many mapping 문제를 해결해야 함DiffStyleTTSConditional diffusion module과 classifier-free guidance를 활용Speech prosodic feature를 hierarchically modeling 하고 다양한 prosodic style을 control논문 (Coling 2025) : Paper Link1. IntroductionTex..
SimpleSpeech2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion ModelsNon-autoregressive Text-to-Speech model은 duration alignment로 인한 complexity가 있음SimpleSpeech2Autoregressive, Non-autoregressive approach를 combine 하여 straightforward model을 구성Simplified data preparation, fast inference, stable generation을 지원논문 (TASLP 2025) : Paper Link1. Introduction..
