
CycleFlow: Leveraging Cycle Consistency in Flow Matching for Speaker Style Adaptation기존 voice conversion system은 inaccurate pitch, low speaker adaptation quality 문제를 가지고 있음CycleFlowSpeaker timbre adaptation training을 위해 Conditional Flow Matching에 Cycle Consistency를 도입VoiceCFM, PitchCFM에 기반한 Dual-CFM을 활용하여 speaker pitch adpatation quality를 향상논문 (ICASSP 2025) : Paper Link1. IntroductionVoice Conve..

Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody PromptingSpeaker-adaptive text-to-speech model은 target speech sample에 sensitive 함Stable-TTSHigh-quality pre-training dataset의 subset인 prior sample의 prosody를 활용하여 target speaker timbre를 효과적으로 반영Fine-tuning 시 prior-preservation loss를 활용하여 target sample에 대한 overfitting을 방지논문 (ICASSP 2025) : Paper Link1. IntroductionYourTTS, VA..

Multilingual DistilWhisper: Efficient Distillation of Multi-Task Speech Models via Language-Specific ExpertsWhisper는 under-represented language에 대해 여전히 낮은 성능을 보임Multilingual DistilWhisperWhisper-Large-V2에 대한 knowledge distillation을 적용Language-specific expert를 통한 lightweight modular ASR fine-tuning논문 (ICASSP 2024) : Paper Link1. IntroductionAutomatic Speech Recognition (ASR) task에서 Whisper는 강력한 성..

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden UnitsSelf-supervised speech representation learning은 다음의 문제에 대응할 수 있어야 함:- 각 input utterance에 multiple sound unit이 존재함- Pre-training phase에서 input sound unit에 대한 lexicon이 존재하지 않음- Sound unit은 explicit segmentation이 아닌 variable length를 가짐HuBERTBERT-like prediction loss의 aligned target label을 제공하기 위해 offline clus..

FlowDec: A Flow-Based Full-Band General Audio Codec with High Perceptual QualityLower bitrate에서도 동작하는 general full-band audio codec이 필요함FlowDecNon-adversarial codec training과 conditional flow matching에 기반한 stochastic postfilter를 활용Fine-tuning이나 distillation 없이 required postfilter evaluation을 절감논문 (ICLR 2025) : Paper Link1. IntroductionAudio codec은 audio waveform을 compact, quantized representatio..

Robust Data2Vec: Noise-Robust Speech Representation Learning for ASR by Combining Regression and Improved Contrastive LearningContrastive learning과 regression task에 기반한 self-supervised pre-training method를 통해 Automatic Speech Recognition 성능을 향상할 수 있음Robust Data2VecPre-training stage에서 contrastive learning과 regression task를 jointly optimizing추가적으로 patch-based non-semantic negative sample과 positiv..