
Robust Data2Vec: Noise-Robust Speech Representation Learning for ASR by Combining Regression and Improved Contrastive LearningContrastive learning과 regression task에 기반한 self-supervised pre-training method를 통해 Automatic Speech Recognition 성능을 향상할 수 있음Robust Data2VecPre-training stage에서 contrastive learning과 regression task를 jointly optimizing추가적으로 patch-based non-semantic negative sample과 positiv..

Data2Vec-AQC: Search for the Right Teaching Assistant in the Teacher-Student Training SetupUnlabled speech data로부터 speech representation을 얻기 위해 Self-Supervised Learning을 활용할 수 있음Data2Vec-AQCData2Vec을 기반으로 data augmentation, quantized representation, clustering을 도입각 module의 interaction을 통해 additional self-supervised objective인 cross-contrastive loss를 solve논문 (ICASSP 2023) : Paper Link1. Introduct..

DetailTTS: Learning Residual Detail Information for Zero-Shot Text-to-Speech기존 text-to-speech system은 linguistic, acoustic detail을 omission 하는 경우가 많음DetailTTSConditional Variational AutoEncoder를 기반으로 하는 zero-shot text-to-speech modelAlignment 과정에서 missed residual detail information을 capture 하는 Prior Detail module과 Duration Detail module을 도입논문 (ICASSP 2025) : Paper Link1. IntroductionZero-shot Te..

FunCodec: A Fundamental, Reproducible and Integrable Open-Source Toolkit for Neural Speech CodecSoundStream, EnCodec과 같은 neural codec에 대한 open-source toolkit이 필요함FunCodecDownstream task에 easily integrate 될 수 있는 open-source codecLower computation, parameter complexity를 가지는 frequency-domain codec을 지원논문 (ICASSP 2024) : Paper Link1. IntroductionSpeech codec은 speech를 compact representation으로 encode 하..

VoicePrompter: Robust Zero-Shot Voice Conversion with Voice Prompt and Conditional Flow MatchingZero-Shot Voice Conversion은 speaker similarity 측면에서 여전히 한계가 있음VoicePrompterSpeech component를 disentangle 하는 factorization method를 활용Factorized feature와 voice prompt에 대한 conditioning을 수행하는 DiT-based Conditional Flow Matching Decoder를 도입Latent Mixup을 통해 various speaker feature를 combining 하여 in-context l..

Data2Vec 2.0: Efficient Self-Supervised Learning with Contextualized Target Representations for Vision, Speech and LanguageSelf-supervised learning을 위해서는 상당한 computational resource가 필요함Data2Vec 2.0Data2Vec을 기반으로 rich contextualized target representation을 얻고,Fast convolutional decoder를 통해 teacher representation을 build 하는데 필요한 effort를 amortize 함논문 (ICML 2023) : Paper Link1. IntroductionSelf-superv..

Data2Vec: A General Framework for Self-Supervised Learning in Speech, Vision and LanguageSelf-supervised learning은 single modality에 초점을 두고 있음Data2VecSpeech, NLP, vision에 동일한 learning method를 적용하는 self-supervised frameworkStandard transformer architecture를 사용하고, self-distillation setup에서 input의 masked view를 기반으로 full input data의 latent representation을 predict- Modality-specific target 대신 entire i..

XLSR: Unsupervised Cross-Lingual Representation Learning for Speech RecognitionMultiple language에서 single model을 pre-training 하여 cross-lingual speech representation을 얻을 수 있음XLSRWav2Vec 2.0을 기반으로 language 간에 share 되는 latent의 quantization을 jointly learning 함추가적으로 labeled data에서 fine-tuning을 수행논문 (INTERSPEECH 2021) : Paper Link1. IntroductionCross-Lingual learning은 other language를 활용하여 model perfor..

UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal PromptsEmotional Text-to-Speech (TTS)는 oversimplified emotional label이나 single-modality input에 의존하므로 human emotion을 효과적으로 반영하지 못함UMETTSEmotion Prompt Alignment module과 Emotion Embedding-Induced TTS module을 활용하여 multiple modality의 emotional cue를 반영Emotion Prompt Alignment module은 contrastive learning을 통해 text, audi..