[Paper 리뷰] Wav2Vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

티스토리 뷰

Paper/Representation

[Paper 리뷰] Wav2Vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

feVeRin 2025. 3. 23. 08:52

Wav2Vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Speech audio만으로 powerful representation을 학습하고 transcribed speech에 대한 fine-tuning을 통해 speech recognition 성능을 향상할 수 있음
Wav2Vec 2.0
- Latent space에서 speech input을 mask
- Jointly learned latent representation의 quantization에 대한 contrastive task를 solve'
논문 (NeurIPS 2020) : Paper Link

1. Introduction

Speech recognition에서 labeled data는 unlabeled data에 비해 얻기 어려움
- 기존에는 VQ-Wav2Vec과 같이 self-attention model을 사용하여 contextualized representation을 학습함
- 한편으로 최근의 self-supervised learning 역시 unlabeled example에서 general data representation을 학습하고 labeled data에 대한 model을 fine-tuning 할 수 있음

-> 그래서 raw audio data로부터 self-supervised representation을 학습할 수 있는 Wav2Vec 2.0을 제안

Wav2Vec 2.0
- Multi-layer convolutional neural network를 통해 speech audio를 encode하고 masked language modeling과 유사하게 resulting latent speech representation을 masking
  - 이후 latent representation은 transformer network에 전달되어 contextualized representation을 구축하고 true latent와 disctractor를 distinguish 하는 contrastive task를 통해 training 됨
- 추가적으로 contrastive task에서 latent representation을 represent 하기 위해 Gumbel softmax를 통해 discrete speech unit을 학습
  - Unlabeled speech에 대한 pre-training 이후, model은 Connectionist Temporal Classification (CTC) loss를 사용하여 labeled data에 대해 fine-tuning 됨

< Overall of Wav2Vec 2.0 >

Raw audio data로부터 self-supervised learning을 통해 representation을 학습하기 위한 framework
결과적으로 기존보다 뛰어난 성능을 달성

2. Method

Wav2Vec 2.0은 raw audio $\mathcal{X}$를 input으로 하여 $T$ time-step에 대한 latent speech representation $\mathbf{z}_{1},...,\mathbf{z}_{T}$를 output 하는 multi-layer convolutional feature encoder $f:\mathcal{X}\mapsto \mathcal{Z}$로 구성됨
- Transformer $g: \mathcal{Z}\mapsto \mathcal{C}$를 통해 entire sequence에서 information을 capture 하는 representation $\mathbf{c}_{1},...,\mathbf{c}_{T}$를 얻음
- Feature encoder output은 quantization module $\mathcal{Z}\mapsto \mathcal{Q}$를 통해 $\mathbf{q}_{t}$로 discretize 되어 self-supervised object의 target을 represent 함
- VQ-Wav2Vec과 비교하여 Wav2Vec 2.0은 continuous speech representation에 대한 context representation을 구축하고 self-attention은 latent representation의 entire sequence에 대한 dependency를 end-to-end로 capture 함

- Feature Encoder

Encoder는 temporal convolution, layer normalization, GELU activation을 포함하는 block으로 구성됨
- Encoder에 대한 raw waveform은 zero-mean, unit variance로 normalize 됨
- Encoder의 total stride는 transformer에 input 되는 time-step 수 $T$를 결정함

- Contextualized Representations with Transformers

Feature encoder output은 transformer architecture를 따르는 context network로 전달됨
- 이때 논문은 absolute positional information을 encode 하는 fixed positional embedding 대신 relative positional embedding과 유사한 convolutional layer를 사용함
- 이후 convolution output에 GELU, layer normalization을 적용함

- Quantization Module

Self-supervised training을 위해 논문은 feature encoder output $\mathbf{z}$를 product quantization을 통해 finite speech representation으로 discretize 함
- Product quantization은 multiple codebook에서 quantized representation을 choosing 하고 concatenating 함
  1. 먼저 $G$ codebook (group), $V$ entry $e\in\mathbb{R}^{V\times d/G}$가 주어진다고 하자
  2. 그러면 각 codebook에서 하나의 entry를 choose 하고 resulting vector $e_{1},...,e_{G}$를 concatenate 한 다음, linear transformation $\mathbb{R}^{d}\mapsto\mathbb{R}^{f}$에 적용하여 $\mathbf{q}\in\mathbb{R}^{f}$를 얻음
- Gumbel softmax를 채택하면 fully differentiable way로 discrete codebook entry를 choice 할 수 있음
  - 따라서 Wav2Vec 2.0은 straight-through estimator와 $G$ hard Gumbel softmax operation을 사용함
- 결과적으로 feature encoder output $\mathbf{z}$는 $\mathbf{l}\in\mathbb{R}^{G\times V}$ logits에 mapping 되고, 이때 group $g$에 대한 $v$-th codebook entry를 choice 할 probability는:
  (Eq. 1) $p_{g,v}=\frac{\exp(l_{g,v}+n_{v})/\tau}{\sum_{k=1}^{V}\exp(l_{g,k}+n_{k})/\tau}$
  - $\tau$ : non-negative temperature, $n=-\log(-\log(u))$, $u$ : $\mathcal{U}(0,1)$의 uniform sample
- Forward pass 중에 codeword $i$는 $i=\arg\max_{j}p_{g,j}$에 의해 chosen 되고 backward pass에서는 Gumbel softmax output의 true gradient가 사용됨

3. Training

Wav2Vec 2.0을 pre-train 하기 위해 논문은 latent feature encoder space에서 certain proportion의 time-step을 mask 함

- Masking

Context network 이전에 feature encoder output/time-step의 proportion을 mask 한 다음, 모든 maksed time-step에 share 되는 trained feature vector로 replace 함
- 이때 quantization module에 대한 input은 mask 하지 않음
- 즉, encoder에서 output 된 latent speech representation을 mask 하기 위해서는:
  1. 모든 time-step의 certain proportion $p$를 randomly sample 하여 starting index로 선정하고,
  2. 각 sampled index에 대해 subsequent $M$ consecutive time-step을 mask 함
    - 여기서 span은 overlap 될 수 있음

- Objective

Pre-training 중에 contrastive task $\mathcal{L}_{m}$을 solve 하여 speech audio representation을 학습함
- 이를 위해 distractor set 내의 masked time-step에 대한 true quantized latent speech representation을 학습해야 함
- 결과적으로 model이 codebook entry를 equally often 하게 사용하도록 codebook diversity loss $\mathcal{L}_{d}$로 augment 됨:
  (Eq. 2) $\mathcal{L}=\mathcal{L}_{m}+\alpha\mathcal{L}_{d}$
  - $\alpha$ : hyperparameter
Contrastive Loss
- Masked time-step $t$로 center 된 context network output $\mathbf{c}_{t}$가 주어진다고 하자
- Wav2Vec 2.0은 $\mathbf{q}_{t}$와 $K$ distractor를 포함하는 $K+1$ quantized candidate representation $\tilde{\mathbf{q}}\in\mathbf{Q}_{t}$에서 true quantized latent speech representation $\mathbf{q}_{t}$를 identify 해야 함
  - Distractor는 same utterance의 other masked time-step에서 uniformly sample 됨
- 결과적으로 contrastive loss는:
  (Eq. 3) $\mathcal{L}_{m}=-\log \frac{\exp(\text{sim}(\mathbf{c}_{t},\mathbf{q}_{t})/\kappa)}{\sum_{\tilde{\mathbf{q}}\sim\mathbf{Q}_{t}}\exp(\text{sim}(\mathbf{c}_{t},\tilde{\mathbf{q}})/\kappa)}$
  - $\text{sim}(\mathbf{a},\mathbf{b})=\mathbf{a}^{\top}\mathbf{b}/||\mathbf{a}||\,||\mathbf{b}||$ : context representation, quantized latent speech representation 간의 cosine similarity
Diversity Loss
- Contrastive task는 positive/negative example을 모두 represent 하기 위해 codebook에 depend 하고, diversity loss $\mathcal{L}_{d}$는 quantized codebook representation의 usage을 늘리기 위해 도입됨
- 즉, utterance batch에 대한 각 codebook entry $\bar{p}_{g}$의 averaged softmax distribution $\mathbf{l}$의 entropy를 maximizing 하여 각 $G$ codebook에서 $V$ entry를 equally use 하도록 함:
  (Eq. 4) $\mathcal{L}_{d}=\frac{1}{GV}\sum_{g=1}^{G}-H(\bar{p}_{g})=\frac{1}{GV}\sum_{g=1}^{G}\sum_{v=1}^{V}\bar{p}_{g,v}\log \bar{p}_{g,v}$
  - Softmax distribution에는 gumbel noise나 temperature가 포함되지 않고, 대신 equivalent 한 perplexity $\frac{GV-\sum_{g=1}^{G}\exp(-\sum_{v=1}^{V}p_{g,v}\log p_{g,v})}{GV}$를 maximize 함

- Fine-Tuning

Pre-trained model은 task의 vocabulary를 represent 하는 $\mathcal{C}$ class를 기반으로 context network에 randomly initialized linear projection을 adding 하여 speech recognition에 대해 fine-tuning 됨
- 대표적으로 LibriSpeech의 경우 character target에 대한 29 token과 word boundary token을 가짐
- 결과적으로 model은 CTC loss를 minimize 하여 optimize 되고 training 중에 time-step과 channel에 masking 하는 SpecAugment를 적용하여 overfitting을 delay 함
  - 이를 통해 few labeled example을 가지는 dataset에 대해 final error rate를 개선할 수 있음

4. Experiments

- Settings

Dataset : LibriSpeech, TIMIT
Comparisons : Discrete BERT, CTC Transformer, S2S Transformer, ContextNet, Conformer

- Results

Low-Resource Labeled Data Evaluation
- 10 min. labeled data 만으로도 Wav2Vec 2.0은 가장 낮은 WER을 달성함

High-Resource Labeled Data Evaluation on LibriSpeech
- 960 hours의 labeled data에 대해서도 Wav2Vec 2.0의 성능이 가장 뛰어남

Phone Recognition on TIMIT
- TIMIT dataset에 대해서도 7.4/8.3 Phoneme Error Rate (PER)을 달성함

Ablations
- Continuous input, quantized target을 사용하는 경우 최상의 성능을 달성할 수 있음

'Paper > Representation' 카테고리의 다른 글

[Paper 리뷰] Data2Vec 2.0: Efficient Self-Supervised Learning with Contextualized Target Representations for Vision, Speech and Language (0)	2025.04.06
[Paper 리뷰] Data2Vec: A General Framework for Self-Supervised Learning in Speech, Vision and Language (0)	2025.04.05
[Paper 리뷰] XLSR: Unsupervised Cross-Lingual Representation Learning for Speech Recognition (0)	2025.04.04
[Paper 리뷰] Wav2Vec: Unsupervised Pre-Training for Speech Recognition (0)	2025.03.22
[Paper 리뷰] Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech (0)	2025.03.20

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Wav2Vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Wav2Vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

1. Introduction

2. Method

- Feature Encoder

- Contextualized Representations with Transformers

- Quantization Module

3. Training

- Masking

- Objective

- Fine-Tuning

4. Experiments

- Settings

- Results

'Paper > Representation' 카테고리의 다른 글

티스토리툴바