
Data2Vec 2.0: Efficient Self-Supervised Learning with Contextualized Target Representations for Vision, Speech and LanguageSelf-supervised learning을 위해서는 상당한 computational resource가 필요함Data2Vec 2.0Data2Vec을 기반으로 rich contextualized target representation을 얻고,Fast convolutional decoder를 통해 teacher representation을 build 하는데 필요한 effort를 amortize 함논문 (ICML 2023) : Paper Link1. IntroductionSelf-superv..

Data2Vec: A General Framework for Self-Supervised Learning in Speech, Vision and LanguageSelf-supervised learning은 single modality에 초점을 두고 있음Data2VecSpeech, NLP, vision에 동일한 learning method를 적용하는 self-supervised frameworkStandard transformer architecture를 사용하고, self-distillation setup에서 input의 masked view를 기반으로 full input data의 latent representation을 predict- Modality-specific target 대신 entire i..

XLSR: Unsupervised Cross-Lingual Representation Learning for Speech RecognitionMultiple language에서 single model을 pre-training 하여 cross-lingual speech representation을 얻을 수 있음XLSRWav2Vec 2.0을 기반으로 language 간에 share 되는 latent의 quantization을 jointly learning 함추가적으로 labeled data에서 fine-tuning을 수행논문 (INTERSPEECH 2021) : Paper Link1. IntroductionCross-Lingual learning은 other language를 활용하여 model perfor..

UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal PromptsEmotional Text-to-Speech (TTS)는 oversimplified emotional label이나 single-modality input에 의존하므로 human emotion을 효과적으로 반영하지 못함UMETTSEmotion Prompt Alignment module과 Emotion Embedding-Induced TTS module을 활용하여 multiple modality의 emotional cue를 반영Emotion Prompt Alignment module은 contrastive learning을 통해 text, audi..

VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via AutoguidanceSpeaker adaptive text-to-speech model에 paramter-efficient fine-tuning을 적용하는 경우, out-of-domain speaker에 대한 adaptation performance의 한계가 있음VoiceGuiderAutoguidance로 reinforce 된 speaker adaptive text-to-speech modelAutoguidance strengthening strategy를 통해 out-of-domain data에 대한 robus..

LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASRMultilingual Automatic Speech Recognition을 위해서는 language interference와 성능 저하 없는 new language incorporation이 필요함LoRA-WhisperWhisper에 LoRA matrix를 incorporate 하여 language interference를 완화LoRA와 language 간의 similarity를 활용하여 new language에 대한 성능을 개선논문 (ICASSP 2024) : Paper Link1. IntroductionAutomatic Speech Recognition (ASR)은 speech를 wr..

CrisperWhisper: Accurate Timestamps on Verbatim Speech TranscriptionsWhisper의 tokenizer를 adjust 하여 word-level timestamps precision을 향상할 수 있음CrisperWhisperWhisper decoder의 cross-attention score에 dynamic time warping을 적용추가적인 fine-tuning을 통해 robustness를 향상논문 (INTERSPEECH 2024) : Paper Link1. IntroductionAutomatic Speech Recognition (ASR)에서 large-scale, weakly supervised learning은 뛰어난 성능을 보이고 있음특히 S..

WaveFM: A High-Fidelity and Efficient Vocoder based on Flow MatchingFlow Matching은 diffusion model에 대한 robust training을 제공하지만 neural vocoder에 directly applying 하면 audio quality가 저하됨WaveFMStandard Gaussian prior 대신 mel-conditioned prior distribution을 채택하여 transportation cost를 minimizeRefined multi-resolution STFT loss를 결합하여 audio quality를 향상추가적으로 inference speed 향상을 위해 consistency distillation me..

Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech SynthesisSpeech synthesis를 위해 autoregressive modeling을 활용할 수 있음CAMMulti-modal latent space를 가지는 Variational AutoEncoder, conditional probability distribution으로써 Gaussian Mixture Model을 활용하는 autoregressive model을 활용특히 Variational AutoEncoder의 latent space에서 continuous speech representation을 통해 training/inference pip..