PFluxTTS: Hybrid Flow-Matching TTS with Robust Cross-Lingual Voice Cloning and Inference-Time Model FusionFlow-matching Text-to-Speech model은 stability-naturalness trade-off, cross-lingual voice cloning의 어려움, low-rate mel-feature에 대한 합성 품질의 한계가 존재함PFluxTTSInference-time vector-field fusion을 통해 duration-guided, alignment-free model을 combine 하는 dual-decoder design을 도입FLUX-based decoder의 speech pro..
ARCHI-TTS: A Flow-Matching-Based Text-to-Speech Model with Self-Supervised Semantic Aligner and Accelerated InferenceDiffusion-based non-autoregressive Text-to-Speech model은 text-speech alignment와 high computational overhead의 문제점이 있음ARCHI-TTSText, audio 간의 robust temporal, semantic consistency를 보장하는 dedicated semantic aligner를 도입 Denoising step에서 encoder feature를 reuse 하여 추론 속도를 향상논문 (ICASSP 202..
ParaStyleTTS: Toward Efficient and Robust Paralinguistic Style Control for Expressive Text-to-Speech GenerationText-to-Speech system에서 speaking style control은 여전히 한계가 있음ParaStyleTTSProsodic, paralinguistic speech style modeling을 separate 하는 2-level style adaptation architecture를 도입추가적으로 low-resource deployment와 다양한 prompt formulation에 대한 consistent style을 유지논문 (CIKM 2025) : Paper Link1. Introduc..
DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech SynthesisDiffusion model은 iterative denoising process로 인해 computationally intensive 함DMOSpeechDistilled diffusion-based model을 활용하여 teacher 보다 더 빠른 추론 속도를 달성Connectionist Temporal Classification, Speaker Verification loss에 대한 end-to-end optimization을 지원논문 (ICML 2025) : Paper Link1. IntroductionSpeechX, MaskGC..
FillerSpeech: Towards Human-Like Text-to-Speech Synthesis with Filler Insertion and Filler Style ControlHuman-like conversational speech synthesis를 위해서는 natural filler insertion이 가능해야 함FillerSpeechFiller style을 tokenize 하고 input text에 대한 cross-attention을 적용추가적으로 natural filler insertion이 가능한 Large Language Model-based filler prediction을 도입논문 (EMNLP 2025) : Paper Link1. IntroductionHierSpeech, Vo..
ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching기존의 large-scale text-to-speech model은 massive parameter로 인해 추론 속도가 느림ZipVoiceZipformer-based vector field estimator, text encoder를 도입하고 average upsampling-based initial speech-text alignment를 활용추가적으로 sampling step을 줄이기 위해 flow distillation method를 도입논문 (ASRU 2025) : Paper Link1. IntroductionVALL-E, VoiceBox, MaskGCT와 같은 z..
