
XLSR: Unsupervised Cross-Lingual Representation Learning for Speech RecognitionMultiple language에서 single model을 pre-training 하여 cross-lingual speech representation을 얻을 수 있음XLSRWav2Vec 2.0을 기반으로 language 간에 share 되는 latent의 quantization을 jointly learning 함추가적으로 labeled data에서 fine-tuning을 수행논문 (INTERSPEECH 2021) : Paper Link1. IntroductionCross-Lingual learning은 other language를 활용하여 model perfor..

UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal PromptsEmotional Text-to-Speech (TTS)는 oversimplified emotional label이나 single-modality input에 의존하므로 human emotion을 효과적으로 반영하지 못함UMETTSEmotion Prompt Alignment module과 Emotion Embedding-Induced TTS module을 활용하여 multiple modality의 emotional cue를 반영Emotion Prompt Alignment module은 contrastive learning을 통해 text, audi..

VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via AutoguidanceSpeaker adaptive text-to-speech model에 paramter-efficient fine-tuning을 적용하는 경우, out-of-domain speaker에 대한 adaptation performance의 한계가 있음VoiceGuiderAutoguidance로 reinforce 된 speaker adaptive text-to-speech modelAutoguidance strengthening strategy를 통해 out-of-domain data에 대한 robus..

LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASRMultilingual Automatic Speech Recognition을 위해서는 language interference와 성능 저하 없는 new language incorporation이 필요함LoRA-WhisperWhisper에 LoRA matrix를 incorporate 하여 language interference를 완화LoRA와 language 간의 similarity를 활용하여 new language에 대한 성능을 개선논문 (ICASSP 2024) : Paper Link1. IntroductionAutomatic Speech Recognition (ASR)은 speech를 wr..