CLAP: Learning Audio Concepts from Natural Language SupervisionRestricted supervision 하에서 training 된 audio model은 flexibility의 한계가 있음CLAPNatural language supervision을 통해 audio concept을 학습2개의 encoder와 contrastive learning을 활용하여 audio, text description을 joint multimodal space로 modeling논문 (ICASSP 2023) : Paper Link1. Introduction대부분의 audio model은 specific task의 pre-defined category와 audio recording..
CosyVoice3: Towards In-the-Wild Speech Generation via Scaling-up and Post-Training앞선 CosyVoice2는 language coverage, domain diversity, data volume 측면에서 한계가 있음CosyVoice3Supervised multi-task training에 기반한 speech tokenizer를 도입Differentiable reward model을 위한 post-training을 적용Data size, model size scaling을 통해 다양한 domain과 text format을 지원논문 (Alibaba 2025) : Paper Link1. IntroductionZero-shot Text-to-Sp..
CosyVoice2: Scalable Streaming Speech Synthesis with Large Language Models기존 CosyVoice를 추가적으로 개선할 수 있음CosyVoice2Speech token의 codebook utilization을 향상하는 finite-scalar quantization을 도입Pre-trained large language model을 backbone으로 사용할 수 있도록 architecture를 streamline 하고 chunk-aware causal flow matching model을 통해 streaming/non-streaming synthesis를 지원논문 (Alibaba 2024) : Paper Link1. IntroductionZero-sh..
ReFlow-VC: Zero-Shot Voice Conversion based on Rectified Flow and Speaker Feature OptimizationDiffusion-based Voice Conversion model은 상당한 sampling step을 요구함ReFlow-VCRectified Flow를 통해 Gaussian distribution을 direct path를 따라 true mel-spectrogram distribution으로 변환추가적으로 content, pitch information을 활용하여 speaker feature를 optimize논문 (INTERSPEECH 2025) : Paper Link1. IntroductionZero-Shot Voice Conversi..
EE-TTS: Emphatic Expressive TTS with Linguistic Information기존의 Text-to-Speech model은 expressive speech를 합성하는데 한계가 있음EE-TTSText에서 appropriate emphasis position을 identify 하는 emphasis predictor를 도입추가적으로 emphasis, linguistic information을 포함한 expressive speech를 합성하기 위해 conditional acoustic model을 활용논문 (INTERSPEECH 2023) : Paper Link1. IntroductionText-to-Speech (TTS) model은 여전히 expressiveness 측면에서 한계..
EME-TTS: Unlocking the Emphasis and Emotion Link in Speech SynthesisEmotional Text-to-Speech와 emphasis-controllable speech synthesis를 integrate 할 수 있음EME-TTSEmphasis pseudo-label과 variance-based emphasis feature 기반의 weakly supervised learning을 활용추가적으로 Emphasis Perception Enhancement block을 통해 emotion signal과 emphasis position 간의 interaction을 향상논문 (INTERSPEECH 2025) : Paper Link1. Introduction기존의..
LinearVC: Linear Transformations of Self-Supervised Features through the Lens of Voice ConversionSelf-supervised representation을 활용하여 voice conversion method를 구성할 수 있음LinearVCSelf-supervised feature에 대한 simple linear transformation을 통해 voice를 convertingAllowed transformation set을 constraining 하고 singular value decomposition을 통해 content, speaker information을 explicitly factorize논문 (INTERSPEECH 20..
DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech CodecHigh-quality speech tokenizer가 필요함DS-CodecMirror-NonMirror architecture switching을 활용한 dual-stage training framework를 도입Mirrored architecture를 통해 learned codebook의 robustness를 향상하고 Mirror-NonMirror structure를 통해 training을 balance논문 (INTERSPEECH 2025) : Paper Link1. Introduction최근 VALL-E, AudioLM, AudioG..
RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow MatchingOrdinary Differential Equation 기반의 Text-to-Speech는 quality와 inference speed 간의 trade-off가 존재함RapFlow-TTSConsistenct quality를 위해 Flow Matching-Straightened Ordinary Differential Equation trajectory를 따라 velocity field의 consistency를 enforceFew-step synthesis의 quality를 향상하기 위해 time interval scheduling, adversa..
