
DecoupledSynth: Enhancing Zero-Shot Text-to-Speech via Factors Decoupling기존의 Zero-Shot Text-to-Speech model은 intermediate representation의 linguistic, para-linguistic, non-linguistic information을 balancing 하는데 어려움이 있음DecoupledSynth다양한 self-supervised model을 combine 하여 comprehensive, decoupled representation을 추출Decoupled processing stage를 활용하여 nuanced synthesis를 지원논문 (ICASSP 2025) : Paper Link1. I..

ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis기존의 text-to-speech model은 phrasing, intonation 측면에서 한계가 있음ProsodyFMProsody 측면에서 phrasing, intonation을 향상하기 위해 Flow Matching backbone을 활용하고 Phrase break encoder, Duration predictor, Terminal intonation encoder를 도입Explicit prosodic label 없이 training 되어 break duration, intonation pattern의 broad spectrum을 uncove..

FACTSpeech: Speaking a Foreign Language Pronunciation Using Only Your Native Characters대부분의 text-to-speech model은 transliterated text를 고려하지 않음FACTSpeechInput text의 pronunciation을 native, literal pronunciation으로 변환하는 language shift embedding을 도입Speaker identity를 preserve 하면서 pronunciation을 향상하기 위해 conditional instance normalization을 적용논문 (INTERSPEECH 2023) : Paper Link1. IntroductionText-to-Speec..

E1-TTS: Simple and Fast Non-Autoregressive TTSEfficient non-autoregressive zero-shot text-to-speech model이 필요함E1-TTSDenoising diffusion pre-training과 distribution matching distillation을 활용Text, audio pair 간의 explicit monotonic alignment를 제거논문 (ICASSP 2025) : Paper Link1. IntroductionNon-Autoregressive (NAR) Text-to-Speech (TTS) model은 text로부터 speech를 parallel 하게 생성하므로, 하나의 unit 씩 합성하는 Autoregres..

PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-Controllable TTSPitch-controllable text-to-seech는 fundamental frequency를 directly modeling 하는 것에 의존함PITSVariational inference를 사용하여 pitch를 modeling 하는 end-to-end modelVITS를 기반으로 Yingram encoder, Yingram decoder, adversarial training을 incorporate논문 (ICML 2023) : Paper Link1. IntroductionText-to-Speech (TTS)는 주어진 ..

LiveSpeech: Low-Latency Zero-Shot Text-to-Speech via Autoregressive Modeling of Audio Discrete CodesNeural audio codec을 통해 zero-shot text-to-speech가 가능하지만 low-latency scenario에서 활용하기 어려움LiveSpeech각 frame의 codebook contribution을 고려한 adaptive codebook loss를 도입Codebook을 grouping 하고 해당 group에 대한 parallel processing을 수행논문 (INTERSPEECH 2024) : Paper Link1. IntroductionNaturalSpeech2와 같은 Zero-shot Text..