
SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic Organization in HuBERTSpeech의 sentence-level representation을 학습하여 syllabic organization을 emerge 할 수 있음SD-HuBERTEntire speech를 summarize 하는 aggregator token으로 pre-trained HuBERT를 fine-tuningSupervision 없이 self-distillation objective를 사용하여 salient syllabic structure를 draw추가적으로 Spoken Speech ABX benchmark를 활용하여 sentence-level representati..

M2R-Whisepr: Multi-Stage and Multi-Scale Retrieval Augmentation for Enhancing WhisperWhisper는 다양한 subdialect를 acculately recognize 하는데 한계가 있음M2R-WhisperIn-Context Learning과 Retrieval-Augmented technique을 Whisper에 도입Pre-processing stage에서 sentence-level in-context learning을 적용하고 post-processing stage에서는 token-level $k$-Nearest Neighbor를 적용논문 (ICASSP 2025) : Paper Link1. IntroductionWhisper는 Autom..

DecoupledSynth: Enhancing Zero-Shot Text-to-Speech via Factors Decoupling기존의 Zero-Shot Text-to-Speech model은 intermediate representation의 linguistic, para-linguistic, non-linguistic information을 balancing 하는데 어려움이 있음DecoupledSynth다양한 self-supervised model을 combine 하여 comprehensive, decoupled representation을 추출Decoupled processing stage를 활용하여 nuanced synthesis를 지원논문 (ICASSP 2025) : Paper Link1. I..

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN based Speaker VerificationSpeaker verification은 speaker representation을 추출하는 neural network에 의존함ECAPA-TDNNInitial frame layer를 1-dimensional Res2Net module로 reconstruct 하고 channel interdependency를 explicitly modeling 하기 위해 Squeeze-and-Excitation block을 도입서로 다른 hierarchical level의 feature를 aggregate, propagate 하고 channe..

ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis기존의 text-to-speech model은 phrasing, intonation 측면에서 한계가 있음ProsodyFMProsody 측면에서 phrasing, intonation을 향상하기 위해 Flow Matching backbone을 활용하고 Phrase break encoder, Duration predictor, Terminal intonation encoder를 도입Explicit prosodic label 없이 training 되어 break duration, intonation pattern의 broad spectrum을 uncove..

UniWav: Towards Unified Pre-Training for Speech Representation Learning and GenerationPre-training과 representation learning은 서로 다른 foundation model을 사용함UniWavPre-training, representation learning을 위한 unified encoder-decoder frameworkRepresentation encoder와 generative decoder를 jointly learning논문 (ICLR 2025) : Paper Link1. IntroductionSpeech representation은 specific task를 excelling 하는 데 사용됨특히 HuBE..