
EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice ConversionEmotional Voice Conversion은 linguistic content는 preserve 하면서 source emotion을 주어진 target으로 convert 하는 것을 목표로 함EmoRegEmotion intensity를 control 하기 위해 Self-Supervised Learning-based feature representation을 활용추가적으로 emotional embedding space에서 Unsupervised Directional Latent Vector Mod..

TTS-Transducer: End-to-End Speech Synthesis with Neural TransducerText-to-Speech를 위해 neural transducer를 활용할 수 있음TTS-TransducerTransducer architecture를 사용하여 tokenized text, speech codec token 간의 first codebook에 대한 monotonic alignment를 학습Non-autoregressive Transformer를 기반으로 transducer loss에서 추출된 alignment를 사용해 remaining code를 predict논문 (ICASSP 2025) : Paper Link1. IntroductionText-to-Speech (TTS)는..

SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic Organization in HuBERTSpeech의 sentence-level representation을 학습하여 syllabic organization을 emerge 할 수 있음SD-HuBERTEntire speech를 summarize 하는 aggregator token으로 pre-trained HuBERT를 fine-tuningSupervision 없이 self-distillation objective를 사용하여 salient syllabic structure를 draw추가적으로 Spoken Speech ABX benchmark를 활용하여 sentence-level representati..

M2R-Whisepr: Multi-Stage and Multi-Scale Retrieval Augmentation for Enhancing WhisperWhisper는 다양한 subdialect를 acculately recognize 하는데 한계가 있음M2R-WhisperIn-Context Learning과 Retrieval-Augmented technique을 Whisper에 도입Pre-processing stage에서 sentence-level in-context learning을 적용하고 post-processing stage에서는 token-level $k$-Nearest Neighbor를 적용논문 (ICASSP 2025) : Paper Link1. IntroductionWhisper는 Autom..

DecoupledSynth: Enhancing Zero-Shot Text-to-Speech via Factors Decoupling기존의 Zero-Shot Text-to-Speech model은 intermediate representation의 linguistic, para-linguistic, non-linguistic information을 balancing 하는데 어려움이 있음DecoupledSynth다양한 self-supervised model을 combine 하여 comprehensive, decoupled representation을 추출Decoupled processing stage를 활용하여 nuanced synthesis를 지원논문 (ICASSP 2025) : Paper Link1. I..

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN based Speaker VerificationSpeaker verification은 speaker representation을 추출하는 neural network에 의존함ECAPA-TDNNInitial frame layer를 1-dimensional Res2Net module로 reconstruct 하고 channel interdependency를 explicitly modeling 하기 위해 Squeeze-and-Excitation block을 도입서로 다른 hierarchical level의 feature를 aggregate, propagate 하고 channe..