EME-TTS: Unlocking the Emphasis and Emotion Link in Speech SynthesisEmotional Text-to-Speech와 emphasis-controllable speech synthesis를 integrate 할 수 있음EME-TTSEmphasis pseudo-label과 variance-based emphasis feature 기반의 weakly supervised learning을 활용추가적으로 Emphasis Perception Enhancement block을 통해 emotion signal과 emphasis position 간의 interaction을 향상논문 (INTERSPEECH 2025) : Paper Link1. Introduction기존의..
LinearVC: Linear Transformations of Self-Supervised Features through the Lens of Voice ConversionSelf-supervised representation을 활용하여 voice conversion method를 구성할 수 있음LinearVCSelf-supervised feature에 대한 simple linear transformation을 통해 voice를 convertingAllowed transformation set을 constraining 하고 singular value decomposition을 통해 content, speaker information을 explicitly factorize논문 (INTERSPEECH 20..
DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech CodecHigh-quality speech tokenizer가 필요함DS-CodecMirror-NonMirror architecture switching을 활용한 dual-stage training framework를 도입Mirrored architecture를 통해 learned codebook의 robustness를 향상하고 Mirror-NonMirror structure를 통해 training을 balance논문 (INTERSPEECH 2025) : Paper Link1. Introduction최근 VALL-E, AudioLM, AudioG..
RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow MatchingOrdinary Differential Equation 기반의 Text-to-Speech는 quality와 inference speed 간의 trade-off가 존재함RapFlow-TTSConsistenct quality를 위해 Flow Matching-Straightened Ordinary Differential Equation trajectory를 따라 velocity field의 consistency를 enforceFew-step synthesis의 quality를 향상하기 위해 time interval scheduling, adversa..
SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language ProcessingSelf-supervised speech/text representation learning을 위해 encoder-decoder pre-training을 활용할 수 있음SpeechT5Shared encoder-decoder network와 6개의 modal-specific pre/post-net을 활용Large-scale unlabeled speech-text data를 통해 model을 pre-training 하고 textual, speech information을 unified semantic space에 align 하기 위해 cross-modal vec..
LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech CodecDiscrete speech token은 high bitrate, redundant timbre information으로 인한 한계를 가짐LSCodecSpeaker perturbation을 활용한 multi-stage unsupervised training framework를 채택Continuous information bottleneck을 설정한 다음, discrete speaker-decoupled space를 생성하는 vector quantization을 수행하고, discrete token vocoder를 통해 acoustic detail을 refine논문 (INTERSPEECH 20..
MPE-TTS: Customized Emotion Zero-Shot Text-to-Speech Using Multi-Modal PromptMulti-modal prompt를 zero-shot Text-to-Speech에 활용할 수 있음MPE-TTS다양한 prompt에서 emotion information을 추출하기 위해 Multi-Modal Prompt Emotion Encoder를 도입추가적으로 prosody predictor와 emotion consistency loss를 적용논문 (INTERSPEECH 2025) : Paper Link1. IntroductionZero-Shot Text-to-Speech (ZS-TTS)는 unseen style의 speech를 생성하는 것을 목표로 함Speech-b..
ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and SpeechEmotional Voice Conversion에서 flexible, interpretable control은 여전히 한계가 있음ClapFM-EVCNatural language prompt와 catrgorical label을 통해 guide 되는 emotional contrastive language-audio pre-training model을 도입Pre-trained Automatic Speech Recognition model의 Phonetic PosteriorGram을 seamless fuse..
Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-SpeechExpressive Text-to-Speech는 여전히 한계가 있음Spotlight-TTS서로 다른 speech region의 continuity를 maintain 하는 Voiced-Aware Style Extraction을 도입추가적으로 추출된 style의 direction을 adjust 하여 speech quality를 향상논문 (INTERSPEECH 2025) : Paper Link1. IntroductionText-to-Speech (TTS)는 input text에..
