
SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language ProcessingSelf-supervised speech/text representation learning을 위해 encoder-decoder pre-training을 활용할 수 있음SpeechT5Shared encoder-decoder network와 6개의 modal-specific pre/post-net을 활용Large-scale unlabeled speech-text data를 통해 model을 pre-training 하고 textual, speech information을 unified semantic space에 align 하기 위해 cross-modal vec..

LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech CodecDiscrete speech token은 high bitrate, redundant timbre information으로 인한 한계를 가짐LSCodecSpeaker perturbation을 활용한 multi-stage unsupervised training framework를 채택Continuous information bottleneck을 설정한 다음, discrete speaker-decoupled space를 생성하는 vector quantization을 수행하고, discrete token vocoder를 통해 acoustic detail을 refine논문 (INTERSPEECH 20..

MPE-TTS: Customized Emotion Zero-Shot Text-to-Speech Using Multi-Modal PromptMulti-modal prompt를 zero-shot Text-to-Speech에 활용할 수 있음MPE-TTS다양한 prompt에서 emotion information을 추출하기 위해 Multi-Modal Prompt Emotion Encoder를 도입추가적으로 prosody predictor와 emotion consistency loss를 적용논문 (INTERSPEECH 2025) : Paper Link1. IntroductionZero-Shot Text-to-Speech (ZS-TTS)는 unseen style의 speech를 생성하는 것을 목표로 함Speech-b..

ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and SpeechEmotional Voice Conversion에서 flexible, interpretable control은 여전히 한계가 있음ClapFM-EVCNatural language prompt와 catrgorical label을 통해 guide 되는 emotional contrastive language-audio pre-training model을 도입Pre-trained Automatic Speech Recognition model의 Phonetic PosteriorGram을 seamless fuse..

Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-SpeechExpressive Text-to-Speech는 여전히 한계가 있음Spotlight-TTS서로 다른 speech region의 continuity를 maintain 하는 Voiced-Aware Style Extraction을 도입추가적으로 추출된 style의 direction을 adjust 하여 speech quality를 향상논문 (INTERSPEECH 2025) : Paper Link1. IntroductionText-to-Speech (TTS)는 input text에..

LM-VC: Zero-Shot Voice Conversion via Speech Generation based on Language ModelsZero-shot voice conversion을 위해 language model을 활용할 수 있음LM-VCSource linguistic content와 target speaker timbre를 recover 하는 coarse token과 converted speech의 acoustic detail을 reconstruct 하는 fine token을 활용Content preservation과 disentanglement를 위해 masked prefix Language Model을 적용추가적으로 sampling error를 alleviate 하기 위해 local a..