
SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language ProcessingSelf-supervised speech/text representation learning을 위해 encoder-decoder pre-training을 활용할 수 있음SpeechT5Shared encoder-decoder network와 6개의 modal-specific pre/post-net을 활용Large-scale unlabeled speech-text data를 통해 model을 pre-training 하고 textual, speech information을 unified semantic space에 align 하기 위해 cross-modal vec..

BEATs: Audio Pre-Training with Acoustic TokenizersGeneral audio representation pre-training을 위헌 Self-Supervised Learning framework가 필요함BEATsSemantic-rich acoustic tokenizer에서 얻어지는 label에 대한 discrete label prediction task를 활용Tokenizer와 pre-trained model에 대한 iterative pipeline을 구성논문 (ICML 2023) : Paper Link1. IntroductionWav2Vec 2.0, HuBERT, WavLM, Data2Vec 등의 speech Self-Supervised Learning (SSL)..

Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear Equation기존의 Self-Supervised Learning model은 speaker identity를 fully disentangle 하지 못함Eta-WavLMSelf-Supervised Learning representation을 speaker-specific, speaker-independent component로 linearly decompose이후 linearly decomposed feature로부터 speaker disentangled representation을 생성논문 (ACL 2025)..

SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic Organization in HuBERTSpeech의 sentence-level representation을 학습하여 syllabic organization을 emerge 할 수 있음SD-HuBERTEntire speech를 summarize 하는 aggregator token으로 pre-trained HuBERT를 fine-tuningSupervision 없이 self-distillation objective를 사용하여 salient syllabic structure를 draw추가적으로 Spoken Speech ABX benchmark를 활용하여 sentence-level representati..

UniWav: Towards Unified Pre-Training for Speech Representation Learning and GenerationPre-training과 representation learning은 서로 다른 foundation model을 사용함UniWavPre-training, representation learning을 위한 unified encoder-decoder frameworkRepresentation encoder와 generative decoder를 jointly learning논문 (ICLR 2025) : Paper Link1. IntroductionSpeech representation은 specific task를 excelling 하는 데 사용됨특히 HuBE..

Balanced-Wav2Vec: Enhancing Stability and Robustness of Representation Learning through Sample Reweighting TechniquesSelf-Supervised Learning model은 mode collapse, dimension collapse로 인해 expressiveness가 떨어짐Balanced-Wav2VecOver-represented mode의 emergence를 suppress 하는 balanced-infoNCE loss를 도입Wav2Vec 2.0의 highly-skewed codebook distribution을 방지하고 stable convergence를 지원논문 (INTERSPEECH 2024) : Pape..