DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable StylesRich, flexible prosodic variation을 위해서는 text-to-prosody의 one-to-many mapping 문제를 해결해야 함DiffStyleTTSConditional diffusion module과 classifier-free guidance를 활용Speech prosodic feature를 hierarchically modeling 하고 다양한 prosodic style을 control논문 (Coling 2025) : Paper Link1. IntroductionTex..
Language-Codec: Bridging Discrete Codec Representations and Speech Language ModelsDiscrete acoustic codec은 speech language model에서 intermediate representation으로 사용됨Language-CodecMasked Channel Residual Vector Quantization을 도입하여 initial codebook의 excessive information 문제를 해결추가적으로 Fourier transform structure, attention block, refined discriminator를 적용논문 (ACL 2025) : Paper Link1. IntroductionVALL-E..
SimpleSpeech2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion ModelsNon-autoregressive Text-to-Speech model은 duration alignment로 인한 complexity가 있음SimpleSpeech2Autoregressive, Non-autoregressive approach를 combine 하여 straightforward model을 구성Simplified data preparation, fast inference, stable generation을 지원논문 (TASLP 2025) : Paper Link1. Introduction..
Efficient Speech Language Modeling via Energy Distance in Continuous Latent SpaceSpeech language model은 discretization으로 인한 한계가 있음SLEDSpeech waveform을 continuous latent representation의 sequence로 encodingEnergy distance objective를 사용하여 autoregressive modeling을 수행논문 (NeurIPS 2025) : Paper Link1. IntroductionSpeech audio는 integer/floating-point range내의 value를 가지는 lengthy sampling point sequence로 re..
Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free GuidanceAutoregressive speech token generation model은 hallucination과 undesired vocalization의 문제가 있음Koel-TTSPreference alignment와 Classifier Free Guidance를 활용하여 Language Model의 contextual adherence를 향상특히 speech recognition model에서 derive 된 automatic metric을 사용하여 model output을 rank 하고 conditional, uncondi..
SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound대부분의 neural codec은 high bitrate에서 동작하고 narrow domain을 가짐SemantiCodecSpeech, general sound, music 등의 다양한 domain을 100 token/sec 이하의 token으로 compress$k$-means clustering을 통해 discretize 된 Self-Supervised Pre-Trained Audio Masked AutoEncoder와 acoustic encoder로 구성된 dual-encoder architecture를 활용논문 (JSTSP 2024) : Paper Link1. Intro..
