
XLSR: Unsupervised Cross-Lingual Representation Learning for Speech RecognitionMultiple language에서 single model을 pre-training 하여 cross-lingual speech representation을 얻을 수 있음XLSRWav2Vec 2.0을 기반으로 language 간에 share 되는 latent의 quantization을 jointly learning 함추가적으로 labeled data에서 fine-tuning을 수행논문 (INTERSPEECH 2021) : Paper Link1. IntroductionCross-Lingual learning은 other language를 활용하여 model perfor..

UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal PromptsEmotional Text-to-Speech (TTS)는 oversimplified emotional label이나 single-modality input에 의존하므로 human emotion을 효과적으로 반영하지 못함UMETTSEmotion Prompt Alignment module과 Emotion Embedding-Induced TTS module을 활용하여 multiple modality의 emotional cue를 반영Emotion Prompt Alignment module은 contrastive learning을 통해 text, audi..

VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via AutoguidanceSpeaker adaptive text-to-speech model에 paramter-efficient fine-tuning을 적용하는 경우, out-of-domain speaker에 대한 adaptation performance의 한계가 있음VoiceGuiderAutoguidance로 reinforce 된 speaker adaptive text-to-speech modelAutoguidance strengthening strategy를 통해 out-of-domain data에 대한 robus..

LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASRMultilingual Automatic Speech Recognition을 위해서는 language interference와 성능 저하 없는 new language incorporation이 필요함LoRA-WhisperWhisper에 LoRA matrix를 incorporate 하여 language interference를 완화LoRA와 language 간의 similarity를 활용하여 new language에 대한 성능을 개선논문 (ICASSP 2024) : Paper Link1. IntroductionAutomatic Speech Recognition (ASR)은 speech를 wr..

CrisperWhisper: Accurate Timestamps on Verbatim Speech TranscriptionsWhisper의 tokenizer를 adjust 하여 word-level timestamps precision을 향상할 수 있음CrisperWhisperWhisper decoder의 cross-attention score에 dynamic time warping을 적용추가적인 fine-tuning을 통해 robustness를 향상논문 (INTERSPEECH 2024) : Paper Link1. IntroductionAutomatic Speech Recognition (ASR)에서 large-scale, weakly supervised learning은 뛰어난 성능을 보이고 있음특히 S..

WaveFM: A High-Fidelity and Efficient Vocoder based on Flow MatchingFlow Matching은 diffusion model에 대한 robust training을 제공하지만 neural vocoder에 directly applying 하면 audio quality가 저하됨WaveFMStandard Gaussian prior 대신 mel-conditioned prior distribution을 채택하여 transportation cost를 minimizeRefined multi-resolution STFT loss를 결합하여 audio quality를 향상추가적으로 inference speed 향상을 위해 consistency distillation me..