SiTok: Scaling Speech Tokenizers with Diffusion AutoEncodersSpeech tokenizer는 semantic/acoustic encoding trade-off와 low bitrate 활용의 한계가 있음SiTokSupervision을 통해 semantic-rich representation을 jointly learning 하고 diffusion을 통해 high-fidelity audio reconstruction을 지원추가적으로 1.6B parameter로 model을 scale 하고 2M hours의 speech dataset으로 training논문 (ICLR 2026) : Paper Link1. Introduction기존 speech tokenizer는 e..
MambaVoiceCloning: Efficient and Expressive Text-to-Speech via State-Space Modeling and Diffusion ControlDiffusion-based Text-to-Speech에 State-Space Model을 도입할 수 있음MamabaVoiceCloningGated bidirectional Mamba text encoder, temporal Bi-Mamba, expressive Mamba를 combine 하여 linear-time $\mathcal{O}(T)$ conditioning을 제공추론 시에는 fixed mel-diffusion-vocoder backbone하에서 attention-based duration, style modu..
Gogo: Group-Wise Granularity-Ordered Codec for Stable and Efficient Speech Generation최근의 speech language model은 autoregressive modeling을 위한 high-level cue, perceptual quality를 위한 acoustic detail을 모두 요구함Gogo각 frame group을 coarse-to-fine으로 quantize 하는 group-wise granularity-ordering을 도입추가적으로 granularity-ordering property를 활용해 2-stage speech language model인 GogoSpeech를 구축논문 (ICLR 2026) : Paper Link..
VibeVoice: Expressive Podcast Generation with Next-Token DiffusionPodcast와 같은 long-form, multi-speaker conversational audio를 생성하기 위해서는 Text-to-Speech system에서 scalability, speaker consistency, natural turn-taking를 보장할 수 있어야 함VibeVoice7.5 ultra-low frame rate의 continuous speech tokenizer를 활용해 long sequence efficiency를 개선추가적으로 next-token diffusion framework를 통해 expressive podcast generation을 지원논문 ..
FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language InstructionsZero-shot Text-to-Speech는 flexible style control을 지원할 수 있어야 함FlexiVoiceProgressive Post-Training을 통해 accurate, flexible style control을 지원특히 Direct Preference Optimization과 multi-objective Group Relative Policy Optimization을 적용논문 (ICLR 2026) : Paper Link1. IntroductionZero-shot Text-to-Speech (TTS)는 Cos..
FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates기존 neural audio codec은 low frame rate에서 semantic information loss가 발생함FlexiCodecDynamic frame rate를 사용해 semantic preservation을 향상ASR feature-assisted dual stream encoding과 Transformer bottelneck을 도입논문 (ICLR 2026) : Paper Link1. IntroductionNeural audio codec은 raw speech를 compact discrete token으로 compress 함특히 대부분의 neural audio codec은 enc..
