DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech CodecHigh-quality speech tokenizer가 필요함DS-CodecMirror-NonMirror architecture switching을 활용한 dual-stage training framework를 도입Mirrored architecture를 통해 learned codebook의 robustness를 향상하고 Mirror-NonMirror structure를 통해 training을 balance논문 (INTERSPEECH 2025) : Paper Link1. Introduction최근 VALL-E, AudioLM, AudioG..
RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow MatchingOrdinary Differential Equation 기반의 Text-to-Speech는 quality와 inference speed 간의 trade-off가 존재함RapFlow-TTSConsistenct quality를 위해 Flow Matching-Straightened Ordinary Differential Equation trajectory를 따라 velocity field의 consistency를 enforceFew-step synthesis의 quality를 향상하기 위해 time interval scheduling, adversa..
SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language ProcessingSelf-supervised speech/text representation learning을 위해 encoder-decoder pre-training을 활용할 수 있음SpeechT5Shared encoder-decoder network와 6개의 modal-specific pre/post-net을 활용Large-scale unlabeled speech-text data를 통해 model을 pre-training 하고 textual, speech information을 unified semantic space에 align 하기 위해 cross-modal vec..
LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech CodecDiscrete speech token은 high bitrate, redundant timbre information으로 인한 한계를 가짐LSCodecSpeaker perturbation을 활용한 multi-stage unsupervised training framework를 채택Continuous information bottleneck을 설정한 다음, discrete speaker-decoupled space를 생성하는 vector quantization을 수행하고, discrete token vocoder를 통해 acoustic detail을 refine논문 (INTERSPEECH 20..
MPE-TTS: Customized Emotion Zero-Shot Text-to-Speech Using Multi-Modal PromptMulti-modal prompt를 zero-shot Text-to-Speech에 활용할 수 있음MPE-TTS다양한 prompt에서 emotion information을 추출하기 위해 Multi-Modal Prompt Emotion Encoder를 도입추가적으로 prosody predictor와 emotion consistency loss를 적용논문 (INTERSPEECH 2025) : Paper Link1. IntroductionZero-Shot Text-to-Speech (ZS-TTS)는 unseen style의 speech를 생성하는 것을 목표로 함Speech-b..
ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and SpeechEmotional Voice Conversion에서 flexible, interpretable control은 여전히 한계가 있음ClapFM-EVCNatural language prompt와 catrgorical label을 통해 guide 되는 emotional contrastive language-audio pre-training model을 도입Pre-trained Automatic Speech Recognition model의 Phonetic PosteriorGram을 seamless fuse..
Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-SpeechExpressive Text-to-Speech는 여전히 한계가 있음Spotlight-TTS서로 다른 speech region의 continuity를 maintain 하는 Voiced-Aware Style Extraction을 도입추가적으로 추출된 style의 direction을 adjust 하여 speech quality를 향상논문 (INTERSPEECH 2025) : Paper Link1. IntroductionText-to-Speech (TTS)는 input text에..
LM-VC: Zero-Shot Voice Conversion via Speech Generation based on Language ModelsZero-shot voice conversion을 위해 language model을 활용할 수 있음LM-VCSource linguistic content와 target speaker timbre를 recover 하는 coarse token과 converted speech의 acoustic detail을 reconstruct 하는 fine token을 활용Content preservation과 disentanglement를 위해 masked prefix Language Model을 적용추가적으로 sampling error를 alleviate 하기 위해 local a..
DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech기존의 emotional Text-to-Speech model은 speaker, emotion characteristic을 fully separate 하지 못함DiEmo-TTSEmotional attribute prediction과 speaker embedding을 사용한 emotion clustering을 도입Style feature를 integrate하는 dual conditioning Transformer를 활용논문 (INTERSPEECH 2025) : Paper ..
