MambaVoiceCloning: Efficient and Expressive Text-to-Speech via State-Space Modeling and Diffusion ControlDiffusion-based Text-to-Speech에 State-Space Model을 도입할 수 있음MamabaVoiceCloningGated bidirectional Mamba text encoder, temporal Bi-Mamba, expressive Mamba를 combine 하여 linear-time $\mathcal{O}(T)$ conditioning을 제공추론 시에는 fixed mel-diffusion-vocoder backbone하에서 attention-based duration, style modu..
FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language InstructionsZero-shot Text-to-Speech는 flexible style control을 지원할 수 있어야 함FlexiVoiceProgressive Post-Training을 통해 accurate, flexible style control을 지원특히 Direct Preference Optimization과 multi-objective Group Relative Policy Optimization을 적용논문 (ICLR 2026) : Paper Link1. IntroductionZero-shot Text-to-Speech (TTS)는 Cos..
DMOSpeech2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech SynthesisDiffusion-based Text-to-Speech의 component를 perceptual metric에 optimize 하는 것은 어려움DMOSpeech2Speaker similarity와 Word Error Rate를 reward로 사용하는 Group Relative Preference Optimization을 적용추가적으로 teacher-guided sampling을 통해 output diversity를 향상논문 (AAAI 2026) : Paper Link1. IntroductionNaturalSpeech, StyleTTS2와..
IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-SpeechLarge-scale autoregressive Text-to-Speech model은 token-by-token generation으로 인해 synthesized speech의 duration을 control 하기 어려움IndexTTS2Token 수를 explicitly specify 하거나 autoregressive manner로 freely generate 하여 duration을 controlEmotional expression, speaker identity 간의 disentanglement를..
MELA-TTS: Joint Transformer-Diffusion Model with Representation Alignment for Speech SynthesisEnd-to-End Text-to-Speech를 위해 joint Transformer-Diffusion framework를 활용할 수 있음MELA-TTSLinguistic, speaker condition으로부터 continuous mel-spectrogram을 autoregressively generateTransformer decoder의 output representation을 pre-trained ASR encoder의 semantic embedding과 align 하는 representation alignment module을 도..
Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech SynthesisFlow-matching-based Text-to-Speech model은 cross-lingual task에 적용하기 어려움Cross-Lingual F5-TTSForced alignment를 활용하여 audio prompt를 pre-process 해 word boundary를 얻어 audio prompt로부터 direct synthesis를 수행Duration modeling을 위해 다양한 linguistic granularity를 가지는 speaking rate predictor를 도입논문 (ICASSP 2026) : Paper Link1. Introduc..
