
CAM: Context-Aware Masking for Robust Speaker VerificationSpeaker Verification은 noise로 인한 성능 저하의 문제가 있음CAMInterest speaker에 focus 하고 unrelated noise는 blur 하는 Speaker embedding network를 구성Speaker, noise characteristic을 capture하는 auxiliary context embedding을 통해 masking threshold를 dynamically control논문 (ICASSP 2021) : Paper Link1. IntroductionSpeaker Verification은 test utterance를 enrollment와 compar..

SpecTokenizer: A Lightweight Streaming Codec in the Compressed Spectrum DomainLightweight neural audio codec이 필요함SpecTokenizerCompressed spectral domain에서 동작하는 lightweight streaming codecCNN, RNN layer를 altering 하여 compressed spectrum domain에서 multi-scale modeling을 수행논문 (INTERSPEECH 2025) : Paper Link1. IntroductionNeural Audio Codec (NAC)는 audio signal을 discrete code sequence로 compress 함BUT, En..

DiffEmotionVC: A Dual-Granularity Disentangled Diffusion Framework for Any-to-Any Emotional Voice ConversionEmotion Voice Conversion은 content, speaker characteristic 간의 entanglement로 인해 어려움이 있음DiffEmotionVCUtterance-level emotional context와 frame-level acoustic detail을 모두 capture 하는 dual-granularity emotion encoder를 도입Gated cross-attention을 통해 emotion feature를 disentangle 하는 orthogonality-constr..

HuBERT-VIC: Improving Noise-Robust Automatic Speech Recognition of Speech Foundation Model via Variance-Invariance-Covariance RegularizationSpeech foundation model은 noise-robustness 측면에서 한계가 있음HuBERT-VICVariance, Invariance, Covariance regularization objective를 활용하여 model을 trainingNoisy speech representation의 statistics를 adjust 하여 다양한 noise type에 대한 generalization ability를 향상논문 (INTERSPEECH 2025..

HuBERT-AGG: Aggregated Representation Distillation of Hidden-Unit BERT for Robust Speech RecognitionAutomatic Speech Recognition을 위한 Self-Supervised Learning은 noise robustness 측면에서 한계가 있음HuBERT-AGGAggregated layer-wise representation을 distill 하여 noise-invariant SSL representation을 학습특히 labeled data의 small portion을 활용해 pre-trained vanilla HuBERT의 모든 hidden state에 대한 weighted sum을 compute 하는 aggre..

PEFT-TTS: Parameter-Efficient Fine-Tuning for Low-Resource Text-to-Speech via Cross-Lingual Continual LearningLow-resource Text-to-Speech를 위한 model이 필요함PEFT-TTSParameter-Efficient Fine-Tuning을 위해 3가지의 adapter를 도입Text embedding을 개선하기 위한 Condition Adapter, input representation을 refine 하는 Prompt Adapter, generation efficiency를 향상하는 DiT LoRA Adapter를 활용논문 (INTERSPEECH 2025) : Paper Link1. Introducti..