
NaturalSpeech3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion ModelsLarge-scale text-to-speech system은 여전히 prosody, similarity 측면에서 한계가 있음NaturalSpeech3Speech waveform을 content, prosody, timbre, acoustic detail의 subspace로 disentangle 하는 Factorized Vector Quantization에 기반한 neural codec을 활용Prompt에 따라 각 subspace에서 attribute를 생성하는 factorized diffusion model을 도입논문 (ICML 2024) : Paper..

NaturalSpeech2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers기존의 large-scale text-to-speech system은 speech를 discrete token으로 quantize 하고 language model을 기반으로 해당 token을 처리함- 따라서 unstable prosody, word skipping/repeating 등의 문제가 발생함NaturalSpeech2Quantized latent vector를 얻기 위해 residual vector quantizer에 기반한 neural audio codec을 활용이후 diffusion model을 활용하여 text input..

ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal StepsDiffusion model을 활용한 singing voice synthesis는 high-quality sample을 얻을 수 있지만 추론 속도의 한계가 있음ConSingerMimimal step 만으로 singing voice synthesis를 수행하기 위해 Consistency Model을 채택특히 training 중에 consistency constraint를 적용논문 (ICASSP 2025) : Paper Link1. IntroductionSinging Voice Synthesis (SVS)는 emotionally realistic human audio를 ..

Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask LearnersVoice Large Language Model은 대부분 single task, monolingual로 제한됨Make-A-VoiceEnd-to-End local/global multiscale transformer를 활용하여 scalable learner를 구성Common knowledge를 share 하고 unseen task에 generalize 하여 in-context learning을 향상Low-resource language에 대한 data scarcity 문제를 해결하는 multilingual learner를 지원논문..

ATP-TTS: Adaptive Thresholding Pseudo-Labeling for Low-Resource Multi-Speaker Text-to-SpeechText-to-Speech는 low-resource scenario에서는 활용하기 어려움ATP-TTSAdaptive Thresholding을 통해 적절한 pseudo-label을 select이후 contrastive learning perturbation으로 enhance 된 Automatic Speech Recognition model을 활용하여 latent representation을 predict논문 (ICASSP 2025) : Paper Link1. IntroductionGlow-TTS, VITS, NaturalSpeech와 같은 su..

SSR-Speech: Towards Stable, Safe and Robust Zero-Shot Text-based Speech Editing and SynthesisStable, safe, robust zero-shot text-to-speech model이 필요함SSR-SpeechTransformer decoder를 기반으로 classifier-free guidance를 incorporateWatermark EnCodec을 통해 edited region에 대한 frame-level watermark를 embed논문 (ICASSP 2025) : Paper Link1. IntroductionYourTTS와 같은 zero-shot text-based speech generation model은 Speech..