M2R-Whisepr: Multi-Stage and Multi-Scale Retrieval Augmentation for Enhancing WhisperWhisper는 다양한 subdialect를 acculately recognize 하는데 한계가 있음M2R-WhisperIn-Context Learning과 Retrieval-Augmented technique을 Whisper에 도입Pre-processing stage에서 sentence-level in-context learning을 적용하고 post-processing stage에서는 token-level $k$-Nearest Neighbor를 적용논문 (ICASSP 2025) : Paper Link1. IntroductionWhisper는 Autom..
DecoupledSynth: Enhancing Zero-Shot Text-to-Speech via Factors Decoupling기존의 Zero-Shot Text-to-Speech model은 intermediate representation의 linguistic, para-linguistic, non-linguistic information을 balancing 하는데 어려움이 있음DecoupledSynth다양한 self-supervised model을 combine 하여 comprehensive, decoupled representation을 추출Decoupled processing stage를 활용하여 nuanced synthesis를 지원논문 (ICASSP 2025) : Paper Link1. I..
ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN based Speaker VerificationSpeaker verification은 speaker representation을 추출하는 neural network에 의존함ECAPA-TDNNInitial frame layer를 1-dimensional Res2Net module로 reconstruct 하고 channel interdependency를 explicitly modeling 하기 위해 Squeeze-and-Excitation block을 도입서로 다른 hierarchical level의 feature를 aggregate, propagate 하고 channe..
ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis기존의 text-to-speech model은 phrasing, intonation 측면에서 한계가 있음ProsodyFMProsody 측면에서 phrasing, intonation을 향상하기 위해 Flow Matching backbone을 활용하고 Phrase break encoder, Duration predictor, Terminal intonation encoder를 도입Explicit prosodic label 없이 training 되어 break duration, intonation pattern의 broad spectrum을 uncove..
UniWav: Towards Unified Pre-Training for Speech Representation Learning and GenerationPre-training과 representation learning은 서로 다른 foundation model을 사용함UniWavPre-training, representation learning을 위한 unified encoder-decoder frameworkRepresentation encoder와 generative decoder를 jointly learning논문 (ICLR 2025) : Paper Link1. IntroductionSpeech representation은 specific task를 excelling 하는 데 사용됨특히 HuBE..
ExpressiveSinger: Multilingual and Multi-Style Score-based Singing Voice Synthesis with Expressive Performance ControlSinging Voice Synthesis는 timing, dynamics, pitch 측면에서 controllability가 부족함ExpressiveSingerPhoneme timing, $F0$ curve, amplitude envelope를 포함하는 expressive performance control signal을 생성Style guidance와 singer timbre embedding을 활용해 performance control signal에서 mel-spectrogram을 생성논문 ..
Balanced-Wav2Vec: Enhancing Stability and Robustness of Representation Learning through Sample Reweighting TechniquesSelf-Supervised Learning model은 mode collapse, dimension collapse로 인해 expressiveness가 떨어짐Balanced-Wav2VecOver-represented mode의 emergence를 suppress 하는 balanced-infoNCE loss를 도입Wav2Vec 2.0의 highly-skewed codebook distribution을 방지하고 stable convergence를 지원논문 (INTERSPEECH 2024) : Pape..
CCC-Wav2Vec 2.0: Clustering Aided Cross Contrastive Self-Supervised Learning of Speech RepresentationsSelf-Supervised Learning은 unlabeled data를 효과적으로 활용할 수 있음CCC-Wav2Vec 2.0Clustering과 Augmentation-based Cross-Contrastive loss를 self-supervised objective로 활용이를 통해 pre-training의 robustness를 향상논문 (SLT 2023) : Paper Link1. IntroductionSelf-Supervised Learning (SSL)은 unlabeled data에서 high-level repres..
FACTSpeech: Speaking a Foreign Language Pronunciation Using Only Your Native Characters대부분의 text-to-speech model은 transliterated text를 고려하지 않음FACTSpeechInput text의 pronunciation을 native, literal pronunciation으로 변환하는 language shift embedding을 도입Speaker identity를 preserve 하면서 pronunciation을 향상하기 위해 conditional instance normalization을 적용논문 (INTERSPEECH 2023) : Paper Link1. IntroductionText-to-Speec..
