
SpeechFlow: Generative Pre-Training for Speech with Flow MatchingSingle pre-trained generative model을 다양한 downstream task에 활용할 수 있음SpeechFlowFlow Matching과 masked condition을 사용하여 untranscribed speech로 pre-training을 수행Pre-trained generative model을 task-specific data로 fine-tuning 하여 다양한 task에 적용논문 (ICLR 2024) : Paper Link1. IntroductionDiscriminative model은 speech recognition, enhancement, separat..

VQ-Wav2Vec: Self-Supervised Learning of Discrete Speech RepresentationsWav2Vec-style self-supervised context prediction을 통해 audio segment의 discrete representation을 학습할 수 있음VQ-Wav2VecGumbel-Softmax, online $k$-means clusetering을 활용하여 dense representation을 quantizeDiscretization을 통해 BERT pre-training을 directly applicate논문 (ICLR 2020) : Paper Link1. IntroductionDiscrete speech representation을 학습하기 ..

XLS-R: Self-Supervised Cross-Lingual Speech Representation Learning at Scale영어 외의 다양한 language에 대한 speech processing task도 지원할 수 있어야 함XLS-RWav2Vec 2.0을 기반으로 한 cross-lingual speech representation learning method128 language에 대한 large-scale speech audio를 활용논문 (INTERSPEECH 2022) : Paper Link1. IntroductionSelf-supervised learning은 다양한 domain, language에 대한 general representation을 제공함특히 high-resource..

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech ProcessingSpeech siganl에는 speaker identity, paralinguistics, spoken content 등의 multi-faceted information이 포함되어 있음WavLMPre-training 과정에서 masked speech prediction, denoising을 jointly leariningInput speech의 sequence ordering을 capture 하기 위해 Transformer structure에 gated relative position bias를 도입논문 (JSTSP 2022) : Paper Link1. Intro..

DistilHuBERT: Speech Representation Learning by Layer-Wise Distillation of Hidden-Unit BERT기존 self-supervised speech representation learning method는 large memory와 high pre-training cost가 요구됨DistilHuBERTHuBERT에서 hidden representation을 directly distill 하는 multi-task learning framework이를 통해 HuBERT size를 75% 절감논문 (ICASSP 2022) : Paper Link1. IntroductionWav2Vec과 같은 speech representation에 대한 Self-Sup..

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden UnitsSelf-supervised speech representation learning은 다음의 문제에 대응할 수 있어야 함:- 각 input utterance에 multiple sound unit이 존재함- Pre-training phase에서 input sound unit에 대한 lexicon이 존재하지 않음- Sound unit은 explicit segmentation이 아닌 variable length를 가짐HuBERTBERT-like prediction loss의 aligned target label을 제공하기 위해 offline clus..