MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows기존의 zero-shot Voice Conversion model은 large parameter size를 요구함MeanVCChunk-wise autoregressive denoising 기반의 diffusion Transformer를 활용해 streaming processing을 지원Mean flow를 통해 single sampling step 만으로도 zero-shot Voice Conversion 성능을 향상논문 (ICASSP 2026) : Paper Link1. IntroductionACE-VC, SEF-VC, AdaptVC와 같은 zero-shot Voice Co..
MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor DisentanglementExpressive, controllable speech를 생성하기 위해서는 speech factor의 entanglement와 control mechanism의 coarse granularity를 해결해야 함MF-SpeechFactor purifier로 사용되는 MF-SpeechEncoder를 기반으로 multi-objective optimization을 수행하여 original speech signal을 independent representation으로 decomposeConductor로 사용되는 MF-Spee..
REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice ConversionSpeech Time Reversal은 speaker identification을 위한 tonal pattern을 가지고 있음REWINDTime-reversed speech에서 학습된 speaker representation을 활용한 augmentation strategy를 도입Diffusion-based voice conversion model에 적용하여 speaker의 unique vocal trait를 preserve 하면서 linguistic content의 interference를 minimize논문 (INTERSP..
ZSDEVC: Zero-Shot Diffusion-based Emotional Voice Conversion with Disentangled MechanismEmotional Voice Conversion은 emotion accuracy와 speech distortion 문제가 존재함ZSDEVCDisentangled mechanism과 expressive guidance를 가지는 diffusion framework를 활용Large emotional speech dataset으로 model을 training논문 (INTERSPEECH 2025) : Paper Link1. IntroductionEmotional Voice Conversion (EVC)는 linguistic content, speaker id..
Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice ConversionZero-shot Voice Conversion은 source speaker의 speaking style을 accurately replicate 하는데 한계가 있음Discl-VCContent, prosody information을 self-supervised speech representation으로부터 disentangleFlow Matching Transformer와 in-context learning을 통해 target speaker voice를 합성논문 (INTERSPEECH 2025) : Paper Link1..
DiffEmotionVC: A Dual-Granularity Disentangled Diffusion Framework for Any-to-Any Emotional Voice ConversionEmotion Voice Conversion은 content, speaker characteristic 간의 entanglement로 인해 어려움이 있음DiffEmotionVCUtterance-level emotional context와 frame-level acoustic detail을 모두 capture 하는 dual-granularity emotion encoder를 도입Gated cross-attention을 통해 emotion feature를 disentangle 하는 orthogonality-constr..
