Differentiable Reward Optimization for LLM based TTS SystemNeural codec language model-based Text-to-Speech system의 성능을 개선할 수 있음DiffRONeural codec token을 기반으로 reward를 directly compute 하고 Gumbel-Softmax를 사용하여 reward function을 differentiable 하도록 구성추가적으로 Multi-Task Reward model을 도입하여 다양한 perspective에서 feedback을 제공논문 (INTERSPEECH 2025) : Paper Link1. IntroductionNeural codec token Language Modeling ..
ParaNoise-SV: Integrated Approach for Noise-Robust Speaker Verification with Parallel Joint Learning of Speech Enhancement and Noise Extraction 기존의 speaker verification model은 noise-robustness 측면에서 한계가 있음ParaNoise-SVNoise Extraction network와 Speech Enhancement network를 combine 한 dual U-Net을 활용Noise Extraction U-Net은 noise를 explicitly modeling 하고 Speech Enhancement U-Net은 parallel connection을 통한 ..
ZSDEVC: Zero-Shot Diffusion-based Emotional Voice Conversion with Disentangled MechanismEmotional Voice Conversion은 emotion accuracy와 speech distortion 문제가 존재함ZSDEVCDisentangled mechanism과 expressive guidance를 가지는 diffusion framework를 활용Large emotional speech dataset으로 model을 training논문 (INTERSPEECH 2025) : Paper Link1. IntroductionEmotional Voice Conversion (EVC)는 linguistic content, speaker id..
LSPNet: An Ultra-Low Bitrate Hybrid Neural CodecUltra-low bitrate에서도 동작할 수 있는 neural codec이 필요함LSPNetLPCNet framework를 기반으로 parameteric encoder를 combine 하여 Line Spectral Pair를 incorporate추가적으로 STFT loss와 Cross-Entropy loss를 활용한 Joint Time-Frequency training strategy를 적용논문 (INTERSPEECH 2025) : Paper Link1. Introduction1.2kbps의 ultra-low bitrate speech coding에서 intelligible, natural-sounding speec..
EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-ContrastContrastive Language Audio Pre-training은 emotion의 ordinal nature를 capture 하지 못하고 audio, text embedding 간의 insufficient alignment가 나타남EmotionRankCLAPEmotional speech와 natural language prompt의 dimensional attribute를 활용하여 fine-grained emotion variation을 jointly captureRank-N-Contrast objective를 ..
ControlSpeech: Towards Simultaneous and Independent Zero-Shot Speaker Cloning and Zero-Shot Language Style ControlSpeaking style control과 adjustment를 위한 Text-to-Speech model이 필요함ControlSpeechSpeech prompt, content prompt, style prompt를 input으로 하여 bidirectional attention, mask-based parallel decoding을 통해 codec representation을 captureStyle Mixture Semantic Density module을 통해 textual style control의..
