VibeVoice: Expressive Podcast Generation with Next-Token DiffusionPodcast와 같은 long-form, multi-speaker conversational audio를 생성하기 위해서는 Text-to-Speech system에서 scalability, speaker consistency, natural turn-taking를 보장할 수 있어야 함VibeVoice7.5 ultra-low frame rate의 continuous speech tokenizer를 활용해 long sequence efficiency를 개선추가적으로 next-token diffusion framework를 통해 expressive podcast generation을 지원논문 ..
FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language InstructionsZero-shot Text-to-Speech는 flexible style control을 지원할 수 있어야 함FlexiVoiceProgressive Post-Training을 통해 accurate, flexible style control을 지원특히 Direct Preference Optimization과 multi-objective Group Relative Policy Optimization을 적용논문 (ICLR 2026) : Paper Link1. IntroductionZero-shot Text-to-Speech (TTS)는 Cos..
FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates기존 neural audio codec은 low frame rate에서 semantic information loss가 발생함FlexiCodecDynamic frame rate를 사용해 semantic preservation을 향상ASR feature-assisted dual stream encoding과 Transformer bottelneck을 도입논문 (ICLR 2026) : Paper Link1. IntroductionNeural audio codec은 raw speech를 compact discrete token으로 compress 함특히 대부분의 neural audio codec은 enc..
ComVo: Toward Complex-Valued Neural Networks for Waveform GenerationiSTFT-based vocoder는 complex spectrogram의 inherent structure를 capture 하기 어려움ComVoGenerator, discriminator에서 native complex arithmetic을 사용하여 complex-valued representation에 대한 structured feedback을 제공Phase quantization을 도입하여 phase value를 discretize 하고 training process를 regularize추가적으로 block-matrix computation을 통해 training efficienc..
VoxCPM: Hierarchical Semantic-Acoustic Modeling via Semi-Discrete Residual Representations for Expressive End-to-End Speech SynthesisSpeech tokenizer 기반의 multi-stage speech synthesis는 semantic-acoustic divide로 인한 trade-off가 존재함VoxCPMSemi-discrete residual representation 기반의 hierarchical semantic-acoustic modeling을 적용추가적으로 natural specialization을 위한 differentiable quantization bottleneck을 도입논문 (I..
DMOSpeech2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech SynthesisDiffusion-based Text-to-Speech의 component를 perceptual metric에 optimize 하는 것은 어려움DMOSpeech2Speaker similarity와 Word Error Rate를 reward로 사용하는 Group Relative Preference Optimization을 적용추가적으로 teacher-guided sampling을 통해 output diversity를 향상논문 (AAAI 2026) : Paper Link1. IntroductionNaturalSpeech, StyleTTS2와..
