MeanVoiceFlow: One-Step Nonparallel Voice Conversion with Mean FlowsVoice Conversion에서 flow-matching model은 iterative inference로 인한 한계가 있음MeanVoiceFlowMean Flow를 기반으로 pre-training, distillation 없이 one-step non-parallel conversion을 지원추가적으로 structural margin reconstruction loss, zero-input constraint를 도입하여 model의 input-output behavior를 regularize논문 (ICASSP 2026) : Paper Link1. IntroductionVoice Co..
CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynmaic Frame Rate대부분의 neural codec은 fixed-frame rate에서 동작하므로 temporal mismatch가 존재함CodecSlimeSchedulable Dynamic Frame Rate를 활용하여 neural codec에서 temporal redundancy를 compressMelt-and-Cool training을 도입해 adaptation을 향상논문 (ICASSP 2026) : Paper Link1. IntroductionNeural speech codec은 lowest achievable frame rate에서 best possible ..
DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech SynthesisEnvironment-aware Text-to-Speech를 위한 model이 필요함DAIEN-TTSPre-trained Speech-Environment Separation module을 활용하여 clean speech, environment audio의 mel-spectrogram을 추출하고, random span mask를 각 mel-spectrogram에 적용하여 infilling process를 지원Speech, environment component에 Dual Classifier-Free Guidance를 적용하여 controllability..
WaveNeXt2: ConvNeXt-based Fast Neural Vocoders with Residual Denoising and Sub-Modeling for GAN and Diffusion Models대부분의 ConvNeXt-based vocoder는 Generative Adversarial Network framework만 사용함WaveNeXt2Residual denoising과 sub-modeling을 도입하여 waveform을 progressively refineGenerative Adeversarial Network, diffusion에 모두 compatible 한 ConvNeXt-based architecture를 구성논문 (ICASSP 2026) : Paper Link1. Introdu..
MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion with Increased Controllability via Multiple Guidances기존의 Voice Conversion model은 fixed conditioning scheme에 의존함MaskVCTContinuous/quantized linguistic feature를 활용하여 intelligibility와 speaker similarity를 향상하고 prosody control을 위해 pitch contuour를 채택특히 multiple Classifier-Free Guidance를 통해 multi-factor control을 지원논문 (ICASSP 2026) :..
RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTSEmotion contorl과 같은 nuanced task에서 기존의 reward optimization method는 reward hacking 문제가 발생함RRPOHybrid regularization을 활용하여 reward signal이 reliably align 되도록 유도특히 policy가 detrimental shortcut을 abandon 하고 emotion의 complex feature를 학습하도록 함논문 (ICASSP 2026) : Paper Link1. IntroductionCosyVoice2와 같이 Large Language Model (LLM)을 활용하면 우수한 T..
