ParaMETA: Towards Learning Disentangled Paralinguistic Speaking Styles Representations from SpeechEmotion, gender, age와 같은 다양한 speaking style에 대한 representation을 학습할 수 있어야 함ParaMETA각 style에 대한 dedicated sub-space로 speech를 project 하여 disentangled, task-specific embedding을 얻음Inter-task interference와 negative transfer를 mitigate 하여 single model로 multiple paralinguistic task를 처리논문 (AAAI 2026) : Paper..
UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-TrainingSpeaker characteristic modeling을 위해 Self-Supervised Learning을 활용할 수 있음UniSpeech-SATMulti-task learning을 도입하여 utterance-wise contrastive loss를 Self-Supervised Learning objective와 integrateUtterance mixing strategy 기반의 data augmentation을 수행논문 (ICASSP 2022) : Paper Link1. IntroductionSelf-Supervised Learning (SSL..
UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled DataSpeech representation learning을 위해 labeled, unlabeled data를 모두 활용할 수 있음UniSpeechSupervised phonetic CTC learning과 phonetically-aware contrastive self-supervised learning을 활용Resultant representation은 phonetic structure를 capture 하여 language, domain에 대한 generalization을 향상논문 (ICML 2021) : Paper Link1. IntroductionAutoma..
DQ-Data2Vec: Decoupling Quantization for Multilingual Speech RecognitionData2Vec의 masked representation generation은 multi-layer averaging에 의존적임DQ-Data2Vec$K$-means quantizer를 사용하여 masked prediction을 위한 language, phoneme information을 decoupling특히 quantization을 shallow, middle layer 모두에 적용하여 irrelevant feature를 explicitly decoupling논문 (TASLP 2025) : Paper Link1. IntroductionXLSR과 같은 Self-Supervise..
Metis: A Foundation Speech Generation Model with Masked Generative Pre-trainingMasked Generative Modeling을 활용하여 다양한 speech generation task에 fine-tuning 되는 speech foundation model을 구성할 수 있음MetisSelf-Supervised Learning token과 acoustic token에 대한 2가지 discrete speech representation을 활용Additional condition 없이 300K hours의 speech data에 대해 masked generative pre-training을 수행논문 (NeurIPS 2025) : Paper Link..
SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space ModelAudio representation learning을 위한 Transformer architecture는 memory, inference time 측면에서 quadratic complexity를 가짐SSAMBAState Space Model인 Mamba를 self-supervised audio representation learning에 도입Bidirectional Mamba를 사용하여 complex audio pattern을 capture 하고 unlabeled dataset으로부터 robust audio representation을 학습논문 (SLT 20..
