
AxLSTMs: Learning Self-Supervised Audio Representations with xLSTMsxLSTM은 Transformer와 비교할만한 성능을 가짐AxLSTMSelf-supervised setting에서 xLSTM을 활용해 masked spectrogram patch로부터 general-purpose audio representation을 학습AudioSet dataset으로 pre-training 하여 다양한 downstream task에 대응논문 (INTERSPEECH 2025) : Paper Link1. IntroductionTransformer는 뛰어난 generalization ability와 data-agnostic nature를 가지지만 scaled dot-pr..

EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-ContrastContrastive Language Audio Pre-training은 emotion의 ordinal nature를 capture 하지 못하고 audio, text embedding 간의 insufficient alignment가 나타남EmotionRankCLAPEmotional speech와 natural language prompt의 dimensional attribute를 활용하여 fine-grained emotion variation을 jointly captureRank-N-Contrast objective를 ..

Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations최근 selective state space model이 주목받고 있음Audio MambaAudio representation learning을 위해 selective state space model에 self-supervised learning을 적용 Randomly masked spectrogram patch를 통해 general-purpose audio representation을 학습논문 (INTERSPEECH 2024) : Paper Link1. IntroductionTransformer는 multiple domain과 data modality에 대한 repr..

HuBERT-VIC: Improving Noise-Robust Automatic Speech Recognition of Speech Foundation Model via Variance-Invariance-Covariance RegularizationSpeech foundation model은 noise-robustness 측면에서 한계가 있음HuBERT-VICVariance, Invariance, Covariance regularization objective를 활용하여 model을 trainingNoisy speech representation의 statistics를 adjust 하여 다양한 noise type에 대한 generalization ability를 향상논문 (INTERSPEECH 2025..

HuBERT-AGG: Aggregated Representation Distillation of Hidden-Unit BERT for Robust Speech RecognitionAutomatic Speech Recognition을 위한 Self-Supervised Learning은 noise robustness 측면에서 한계가 있음HuBERT-AGGAggregated layer-wise representation을 distill 하여 noise-invariant SSL representation을 학습특히 labeled data의 small portion을 활용해 pre-trained vanilla HuBERT의 모든 hidden state에 대한 weighted sum을 compute 하는 aggre..

DinoSR: Self-Distillation and Online Clustering for Self-Supervised Speech Representation LearningSpeech를 위한 strong representation learning model이 필요함DinoSRMasked language modeling, self-distillation, online clustering을 combineTeacher network를 사용하여 input audio에서 contextualized embedding을 추출하고, embedding에 online clustering을 적용하고, discretized token을 통해 student network를 guide논문 (NeurIPS 2023) : Pap..