
NaturalSpeech3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion ModelsLarge-scale text-to-speech system은 여전히 prosody, similarity 측면에서 한계가 있음NaturalSpeech3Speech waveform을 content, prosody, timbre, acoustic detail의 subspace로 disentangle 하는 Factorized Vector Quantization에 기반한 neural codec을 활용Prompt에 따라 각 subspace에서 attribute를 생성하는 factorized diffusion model을 도입논문 (ICML 2024) : Paper..

NaturalSpeech2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers기존의 large-scale text-to-speech system은 speech를 discrete token으로 quantize 하고 language model을 기반으로 해당 token을 처리함- 따라서 unstable prosody, word skipping/repeating 등의 문제가 발생함NaturalSpeech2Quantized latent vector를 얻기 위해 residual vector quantizer에 기반한 neural audio codec을 활용이후 diffusion model을 활용하여 text input..

ATP-TTS: Adaptive Thresholding Pseudo-Labeling for Low-Resource Multi-Speaker Text-to-SpeechText-to-Speech는 low-resource scenario에서는 활용하기 어려움ATP-TTSAdaptive Thresholding을 통해 적절한 pseudo-label을 select이후 contrastive learning perturbation으로 enhance 된 Automatic Speech Recognition model을 활용하여 latent representation을 predict논문 (ICASSP 2025) : Paper Link1. IntroductionGlow-TTS, VITS, NaturalSpeech와 같은 su..

SSR-Speech: Towards Stable, Safe and Robust Zero-Shot Text-based Speech Editing and SynthesisStable, safe, robust zero-shot text-to-speech model이 필요함SSR-SpeechTransformer decoder를 기반으로 classifier-free guidance를 incorporateWatermark EnCodec을 통해 edited region에 대한 frame-level watermark를 embed논문 (ICASSP 2025) : Paper Link1. IntroductionYourTTS와 같은 zero-shot text-based speech generation model은 Speech..

Evidential-TTS: High Fidelity Zero-Shot Text-to-Speech Using Evidential Deep LearningZero-shot text-to-speech를 위해 Evidential Deep Learning을 활용할 수 있음Evidiential-TTSIterative Parallel Decoding을 사용하여 aligned phoneme sequence를 acoustic token으로 convert Evidential Deep Learning optimization에 기반한 model uncertainty를 도입해 high quality speech generation을 위한 reliable sampling path를 제공논문 (ICASSP 2025) : Pape..

LEF-TTS: Lightweight and Efficient End-to-End Text-to-Speech Synthesis with Multi-Stream Generator최근에는 lightweight, efficient Text-to-Speech model의 요구가 증가하고 있음LEF-TTSEfficientTTS2를 기반으로 Single Head Fast Linear Attention을 적용ConvWaveNet과 multi-stream iSTFT generator를 도입해 inference speed를 개선논문 (ICASSP 2025) : Paper Link1. IntroductionFastSpeech, FastSpeech2와 같은 two-stage TTS model에 비해 VITS와 같은 end-..