SimpleSpeech2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion ModelsNon-autoregressive Text-to-Speech model은 duration alignment로 인한 complexity가 있음SimpleSpeech2Autoregressive, Non-autoregressive approach를 combine 하여 straightforward model을 구성Simplified data preparation, fast inference, stable generation을 지원논문 (TASLP 2025) : Paper Link1. Introduction..
Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis대부분의 emotional Text-to-Speech는 word-level control이 어려움WeSConPre-trained zero-shot Text-to-Speech model로부터 emotion, speaking rate를 control 하는 self-training frameworkWord-level expressive synthesis를 guide 하기 위한 transition-smoothing strategy, dynamic speed control mechanism을 도입추론 시에는 dynamic emotional attention bias mechan..
Shallow Flow Matching for Coarse-to-Fine Text-to-Speech SynthesisFlow Matching-based Text-to-Speech model을 개선할 수 있음Shallow Flow Matching (SFM)Coarse representation으로부터 Flow Matching path를 따라 intermediate state를 construct해당 state의 temporal position을 adaptively determine 하기 위해 orthogonal projection을 도입논문 (NeurIPS 2025) : Paper Link1. IntroductionVoiceBox, ReFlow-TTS, VoiceFlow와 같은 Flow Matching (F..
HierSpeech++: Bridging the Gap Between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-Shot Speech SynthesisZero-shot speech synthesis는 inference speed와 robustness의 한계가 있음HierSpeech++Hierarchical synthesis framework를 활용하여 naturalness를 향상Text representation과 prosody prompt를 기반으로 self-supervised/$F0$ representation을 생성하는 Text-to-Vec framework를 도입하고 16k..
ControlSpeech: Towards Simultaneous and Independent Zero-Shot Speaker Cloning and Zero-Shot Language Style ControlSpeaking style control과 adjustment를 위한 Text-to-Speech model이 필요함ControlSpeechSpeech prompt, content prompt, style prompt를 input으로 하여 bidirectional attention, mask-based parallel decoding을 통해 codec representation을 captureStyle Mixture Semantic Density module을 통해 textual style control의..
PEFT-TTS: Parameter-Efficient Fine-Tuning for Low-Resource Text-to-Speech via Cross-Lingual Continual LearningLow-resource Text-to-Speech를 위한 model이 필요함PEFT-TTSParameter-Efficient Fine-Tuning을 위해 3가지의 adapter를 도입Text embedding을 개선하기 위한 Condition Adapter, input representation을 refine 하는 Prompt Adapter, generation efficiency를 향상하는 DiT LoRA Adapter를 활용논문 (INTERSPEECH 2025) : Paper Link1. Introducti..
