CAM: Context-Aware Masking for Robust Speaker VerificationSpeaker Verification은 noise로 인한 성능 저하의 문제가 있음CAMInterest speaker에 focus 하고 unrelated noise는 blur 하는 Speaker embedding network를 구성Speaker, noise characteristic을 capture하는 auxiliary context embedding을 통해 masking threshold를 dynamically control논문 (ICASSP 2021) : Paper Link1. IntroductionSpeaker Verification은 test utterance를 enrollment와 compar..
SpecTokenizer: A Lightweight Streaming Codec in the Compressed Spectrum DomainLightweight neural audio codec이 필요함SpecTokenizerCompressed spectral domain에서 동작하는 lightweight streaming codecCNN, RNN layer를 altering 하여 compressed spectrum domain에서 multi-scale modeling을 수행논문 (INTERSPEECH 2025) : Paper Link1. IntroductionNeural Audio Codec (NAC)는 audio signal을 discrete code sequence로 compress 함BUT, En..
DiffEmotionVC: A Dual-Granularity Disentangled Diffusion Framework for Any-to-Any Emotional Voice ConversionEmotion Voice Conversion은 content, speaker characteristic 간의 entanglement로 인해 어려움이 있음DiffEmotionVCUtterance-level emotional context와 frame-level acoustic detail을 모두 capture 하는 dual-granularity emotion encoder를 도입Gated cross-attention을 통해 emotion feature를 disentangle 하는 orthogonality-constr..
HuBERT-VIC: Improving Noise-Robust Automatic Speech Recognition of Speech Foundation Model via Variance-Invariance-Covariance RegularizationSpeech foundation model은 noise-robustness 측면에서 한계가 있음HuBERT-VICVariance, Invariance, Covariance regularization objective를 활용하여 model을 trainingNoisy speech representation의 statistics를 adjust 하여 다양한 noise type에 대한 generalization ability를 향상논문 (INTERSPEECH 2025..
HuBERT-AGG: Aggregated Representation Distillation of Hidden-Unit BERT for Robust Speech RecognitionAutomatic Speech Recognition을 위한 Self-Supervised Learning은 noise robustness 측면에서 한계가 있음HuBERT-AGGAggregated layer-wise representation을 distill 하여 noise-invariant SSL representation을 학습특히 labeled data의 small portion을 활용해 pre-trained vanilla HuBERT의 모든 hidden state에 대한 weighted sum을 compute 하는 aggre..
PEFT-TTS: Parameter-Efficient Fine-Tuning for Low-Resource Text-to-Speech via Cross-Lingual Continual LearningLow-resource Text-to-Speech를 위한 model이 필요함PEFT-TTSParameter-Efficient Fine-Tuning을 위해 3가지의 adapter를 도입Text embedding을 개선하기 위한 Condition Adapter, input representation을 refine 하는 Prompt Adapter, generation efficiency를 향상하는 DiT LoRA Adapter를 활용논문 (INTERSPEECH 2025) : Paper Link1. Introducti..
SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker VerificationSpeaker Verification에서 Self-Supervised Learning은 동일한 speaker의 anchor-positive pair만을 사용함SSPS주어진 anchor에 대해 latent space에서 clustering assignment와 memory queue를 적용동일한 speaker지만 서로 다른 recording condition을 가지는 appropriate positive를 find논문 (INTERSPEECH 2025) : Paper Link1. IntroductionSpeaker Verification (SV)는 주어진..
FreeCodec: A Disentangled Neural Speech Codec with Fewer TokensNeural speech codec은 fewer token에 대해서는 성능 저하를 보임FreeCodecDistinct frame-level encoder를 사용하여 intrinsic speech property를 decompose서로 다른 frame-level information을 dedicated quantizer로 quantizing 하여 encoding efficiency를 향상논문 (INTERSPEECH 2025) : Paper Link1. IntroductionNeural Speech Codec은 distortion을 최소화하면서 제한된 bit 수로 speech signal을 com..
Training-Free Voice Conversion with Factorized Optimal Transport$k$NN-VC를 training-free pipeline으로 수정할 수 있음MKL-VC$k$NN regression을 Monge-Kantorovich Linear solution에서 derive 된 WavLM embedding subspace 내의 factorized optimal transport map으로 replaceDimension 간 non-uniform variance를 처리하여 effective feature transformation을 보장논문 (INTERSPEECH 2025) : Paper Link1. IntroductionAny-to-Any Voice Conversion ..
