DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable StylesRich, flexible prosodic variation을 위해서는 text-to-prosody의 one-to-many mapping 문제를 해결해야 함DiffStyleTTSConditional diffusion module과 classifier-free guidance를 활용Speech prosodic feature를 hierarchically modeling 하고 다양한 prosodic style을 control논문 (Coling 2025) : Paper Link1. IntroductionTex..
Language-Codec: Bridging Discrete Codec Representations and Speech Language ModelsDiscrete acoustic codec은 speech language model에서 intermediate representation으로 사용됨Language-CodecMasked Channel Residual Vector Quantization을 도입하여 initial codebook의 excessive information 문제를 해결추가적으로 Fourier transform structure, attention block, refined discriminator를 적용논문 (ACL 2025) : Paper Link1. IntroductionVALL-E..
SimpleSpeech2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion ModelsNon-autoregressive Text-to-Speech model은 duration alignment로 인한 complexity가 있음SimpleSpeech2Autoregressive, Non-autoregressive approach를 combine 하여 straightforward model을 구성Simplified data preparation, fast inference, stable generation을 지원논문 (TASLP 2025) : Paper Link1. Introduction..
Efficient Speech Language Modeling via Energy Distance in Continuous Latent SpaceSpeech language model은 discretization으로 인한 한계가 있음SLEDSpeech waveform을 continuous latent representation의 sequence로 encodingEnergy distance objective를 사용하여 autoregressive modeling을 수행논문 (NeurIPS 2025) : Paper Link1. IntroductionSpeech audio는 integer/floating-point range내의 value를 가지는 lengthy sampling point sequence로 re..
Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free GuidanceAutoregressive speech token generation model은 hallucination과 undesired vocalization의 문제가 있음Koel-TTSPreference alignment와 Classifier Free Guidance를 활용하여 Language Model의 contextual adherence를 향상특히 speech recognition model에서 derive 된 automatic metric을 사용하여 model output을 rank 하고 conditional, uncondi..
SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound대부분의 neural codec은 high bitrate에서 동작하고 narrow domain을 가짐SemantiCodecSpeech, general sound, music 등의 다양한 domain을 100 token/sec 이하의 token으로 compress$k$-means clustering을 통해 discretize 된 Self-Supervised Pre-Trained Audio Masked AutoEncoder와 acoustic encoder로 구성된 dual-encoder architecture를 활용논문 (JSTSP 2024) : Paper Link1. Intro..
Metis: A Foundation Speech Generation Model with Masked Generative Pre-trainingMasked Generative Modeling을 활용하여 다양한 speech generation task에 fine-tuning 되는 speech foundation model을 구성할 수 있음MetisSelf-Supervised Learning token과 acoustic token에 대한 2가지 discrete speech representation을 활용Additional condition 없이 300K hours의 speech data에 대해 masked generative pre-training을 수행논문 (NeurIPS 2025) : Paper Link..
Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis대부분의 emotional Text-to-Speech는 word-level control이 어려움WeSConPre-trained zero-shot Text-to-Speech model로부터 emotion, speaking rate를 control 하는 self-training frameworkWord-level expressive synthesis를 guide 하기 위한 transition-smoothing strategy, dynamic speed control mechanism을 도입추론 시에는 dynamic emotional attention bias mechan..
TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling기존의 speech tokenizer는 high frame rate와 auxiliary pre-trained model에 대한 의존성, complex training process와 같은 한계점이 존재함TaDiCodecDiffusion AutoEncoder를 활용해 quantization, reconstruction에 대한 end-to-end optimization을 수행Text guidance를 diffusion decoder에 integrate 하여 optimal compression을 달성논문 (NeurIPS 2025) : Paper Link1. Introduct..
