
Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice ConversionZero-shot Voice Conversion은 source speaker의 speaking style을 accurately replicate 하는데 한계가 있음Discl-VCContent, prosody information을 self-supervised speech representation으로부터 disentangleFlow Matching Transformer와 in-context learning을 통해 target speaker voice를 합성논문 (INTERSPEECH 2025) : Paper Link1..

Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations최근 selective state space model이 주목받고 있음Audio MambaAudio representation learning을 위해 selective state space model에 self-supervised learning을 적용 Randomly masked spectrogram patch를 통해 general-purpose audio representation을 학습논문 (INTERSPEECH 2024) : Paper Link1. IntroductionTransformer는 multiple domain과 data modality에 대한 repr..

CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware MaskingECAPA-TDNN은 high complexity와 slow inference speed의 문제가 있음CAM++Context-Aware Masking을 densely-connected Time Delay Neural Network backbone에 적용Multi-granularity pooling을 적용하여 서로 다른 level의 textual information을 capture논문 (INTERSPEECH 2023) : Paper Link1. IntroductionSpeaker Verification (SV)는 voice characteristic..

CAM: Context-Aware Masking for Robust Speaker VerificationSpeaker Verification은 noise로 인한 성능 저하의 문제가 있음CAMInterest speaker에 focus 하고 unrelated noise는 blur 하는 Speaker embedding network를 구성Speaker, noise characteristic을 capture하는 auxiliary context embedding을 통해 masking threshold를 dynamically control논문 (ICASSP 2021) : Paper Link1. IntroductionSpeaker Verification은 test utterance를 enrollment와 compar..

SpecTokenizer: A Lightweight Streaming Codec in the Compressed Spectrum DomainLightweight neural audio codec이 필요함SpecTokenizerCompressed spectral domain에서 동작하는 lightweight streaming codecCNN, RNN layer를 altering 하여 compressed spectrum domain에서 multi-scale modeling을 수행논문 (INTERSPEECH 2025) : Paper Link1. IntroductionNeural Audio Codec (NAC)는 audio signal을 discrete code sequence로 compress 함BUT, En..

DiffEmotionVC: A Dual-Granularity Disentangled Diffusion Framework for Any-to-Any Emotional Voice ConversionEmotion Voice Conversion은 content, speaker characteristic 간의 entanglement로 인해 어려움이 있음DiffEmotionVCUtterance-level emotional context와 frame-level acoustic detail을 모두 capture 하는 dual-granularity emotion encoder를 도입Gated cross-attention을 통해 emotion feature를 disentangle 하는 orthogonality-constr..