[Paper 리뷰] Wav2Vec-Aug: Improved Self-Supervised Training with Limited Data

티스토리 뷰

Paper/Representation

[Paper 리뷰] Wav2Vec-Aug: Improved Self-Supervised Training with Limited Data

feVeRin 2025. 6. 2. 17:33

Wav2Vec-Aug: Improved Self-Supervised Training with Limited Data

다양한 language에 대한 unlabeled data의 부족으로 인해 speech representation에 대한 Self-Supervised Learning은 여전히 한계가 있음
Wav2Vec-Aug
- Wav2Vec 2.0 pre-training에 data augmentation을 적용
- Limited available data를 가지는 domain에 대해 Self-Supervised Learning을 적용
논문 (INTERSPEECH 2022) : Paper Link

1. Introduction

Self-Supervised Learning (SSL)은 unlabeld speech로부터 representation을 학습할 수 있음
- 해당 방식은 large unlabeled audio data가 필요하지만 rare language에 대해서는 data를 확보하기 어려움
- 한편으로 data augmentation은 labeled, unlabeled data가 제한적인 경우 효과적으로 사용할 수 있음
  - 대표적으로 Contrastive Predictive Coding (CPC)은 past context를 기반으로 future에 대한 prediction을 수행하여 Wav2Vec 2.0과 같은 Transformer-based bi-directional model을 개선할 수 있음

-> 그래서 Wav2Vec 2.0에 data augmentation을 적용한 Wav2Vec-Aug를 제안

Wav2Vec-Aug
- Wav2Vec 2.0을 기반으로 data augmentation strategy를 적용
- Feature encoder에서 convolutional layer를 light, dynamic convolution으로 replace 하고, Transformer를 Conformer layer로 replace
- 추가적으로 Wav2Vec 2.0의 context, latent vector에 MLP projection을 적용

< Overall of Wav2Vec-Aug >

Wav2Vec 2.0을 기반으로 architectural replacement와 data augmentation을 적용한 SSL speech model
결과적으로 기존보다 뛰어난 성능을 달성

2. Background

Wav2Vec-Aug는 raw audio $\mathbf{x}\in\mathcal{X}$를 latent feature representation $\mathbf{c}_{1},...,\mathbf{c}_{T}$로 mapping하는 Wav2Vec 2.0을 기반으로 함
- Wav2Vec 2.0은 input audio를 latent speech representation $\mathbf{z}_{1},...,\mathbf{z}_{T}$로 mapping 하는 convolutional feature encoder $f:\mathcal{X}\mapsto \mathcal{Z}$로 구성됨
  1. 이후 해당 latent representation은 Transformer model $g:\mathcal{Z}\mapsto \mathcal{C}$에 전달되어 context representation $\mathbf{c}_{1},...,\mathbf{c}_{T}$를 output 함
  2. 여기서 각 $\mathbf{z}_{t}$는 20ms로 stride 된 25ms audio를 represent 하고 Transformer architecture는 BERT를 따름
- Pre-training 시 latent representation은 target을 represent 하기 위해 quantization module $\mathcal{Z}\mapsto \mathcal{Q}$를 사용하여 $\mathbf{q}_{1},...,\mathbf{q}_{T}$로 quantize 됨
  - Quantization module은 $G=2$의 codebook에서 $V=320$ entry를 choice 하기 위해 Gumbel Softmax를 사용하고, chosen entry는 concatenate 되어 $\mathbf{q}$를 얻음
- 이때 model은 other masked timestep에서 sampling 된 $K=100$ distractor $\mathbf{Q}_{t}$의 set에서, 각 masked timestep에 대해 $\mathbf{c}_{t}$를 사용하여 true quantized latent $\mathbf{q}_{t}$를 identify 하도록 training 됨

3. Method

- Data Augmentation

Input audio에 대한 data augmentation은 small data setting에서 유용함
- 특히 논문은 self-supervised learning을 위해 Additive augmentation, Pitch Shift, Reverberation의 3가지 data augmentation strategy를 고려함
  1. Additive Augmentation은 input signal $\mathbf{x}$에 noise signal $\mathbf{x}'$을 add 함
    - 이때 noise signal은 large audio signal collection에서 uniformly random choice 됨
    - Chosen noise signal은 hyperparameter $s_{0},s_{1}$에 대해, $s\sim \text{Uniform}(s_{0},s_{1})$의 SNR value로 add 됨
  2. Pitch Shift는 input audio의 pitch를 random factor $f$로 raise/lowering 하는 것을 의미함
    - Wav2Vec-Aug에서는 hyperparameter $\sigma_{p}$에 대해 pitch shift factor를 Gaussian distribution $f\sim \mathcal{N}(0,\sigma_{p})$에서 sampling함
  3. Reverberation은 input audio signal을 randomly generated Room Impulse Response (RIR) signal과 convolving 하여 far-field speech를 simulate 함
    - Room size parameter $r$은 $r'\sim\mathcal{N}(0,\sigma_{r})$에서 sampling 하여 randomly choice 된 다음, $r=\min(|r'|,100)$으로 설정됨
- Pre-training 중에 논문은 각 data sample에 대해 augmentation method를 적용할지 여부를 probability $p$로 independently choice 함
  1. 먼저 input audio $\mathbf{x}$를 source audio $\mathbf{x}^{(s)}$와 target audio $\mathbf{x}^{(t)}$로 duplicate 함
  2. 이후 $\mathbf{x}^{(s)}$를 사용하여 context vector를 생성하고, $\mathbf{x}^{(t)}$를 사용하여 target latent representation $\mathbf{q}_{t}$를 생성함
    - 특히 source, target audio에 대해 서로 다른 augmentation을 적용하는 것이 유용함

- Architectural Improvements

Lightweight and Dynamic Convolution
- Wav2Vec 2.0은 input audio signal에서 feature를 extract 하기 위해 purely convolutional feature encoder를 사용함
- 이때 lightweight, dynamic convolution을 활용하면 standard convolution 보다 더 나은 성능을 얻을 수 있음
  1. 먼저 lightweight convolution은 $m$-channel group 간에 weight를 share 하는 depth-wise separable convolution에 해당함
    - $m$ : hyperparameter
  2. Dynamic convolution은 lightweight convolution을 기반으로, 주어진 timestep에서 input에 대한 function으로 convolution kernel을 dynamically compute 함
  3. 해당 lightweight, dynamic convolution layer는 Transformer 보다 computationally-efficient 하므로 많은 timestep에서 동작하는 feature encoder에 적합함
- 따라서 논문은 feature encoder의 last $k$ convolutional layer를 lightweight, dynamic convolution으로 replace 함
  - 실제로 $k=2$일 때 lightweight, dynamic convolution 모두에서 성능 향상을 얻을 수 있음
Conformer
- Wav2Vec 2.0의 Transformer part는 multi-head self-attention block을 활용함
- BUT, Conformer block을 활용하면 Transformer architecture 보다 더 나은 성능을 달성할 수 있음
  1. Conformer는 multi-head self-attention, depth-wise convolution, feed-forward layer로 구성됨
  2. 이를 통해 Conformer는 global, local interaction을 effectively use 할 수 있음
Context and Target MLPs
- 추가적으로 논문은 Multi-Layer Perceptron (MLP)를 add 하여 loss value computation을 modify 함
- 이를 위해 context $\mathbf{c}$와 latent vector $\mathbf{q}$에 각각 적용되는 ContextMLP와 TargetMLP를 도입함:
  (Eq. 1) $\mathbf{c}'=\text{CMLP}(\mathbf{c}),\,\,\,\mathbf{q}'=\text{TMLP}(\mathbf{q})$
- Wav2Vec 2.0은 $K$ distractor $\tilde{\mathbf{q}}$의 set 내에서 $\mathbf{c}$로부터 true $\mathbf{q}_{t}$를 identify 하도록 contrastive loss를 통해 train 됨
  - Contrastive loss input은 context vector $\mathbf{c}$와 $\mathbf{q}_{t},\tilde{\mathbf{q}}$ 간의 cosine-similarity를 사용함
  - 따라서 논문은 $\mathbf{c}$에 $\text{CMLP}$를 적용하고 $\mathbf{q}_{t},\tilde{\mathbf{q}}$에 $\text{TMLP}$를 적용한 다음, cosine-similarity를 compute 함
    - 이를 통해 model은 simple cosine-similarity가 아닌 complex, non-linear similarity를 학습할 수 있음

4. Experiments

- Settings

Dataset : LibriSpeech
Comparisons : Wav2Vec 2.0

- Results

전체적으로 Wav2Vec-Aug의 성능이 더 뛰어남

Data Augmentation
- Augmentation hyperparameter에 대해, 논문은 additive augmentation은 $s_{0}=10, s_{1}=15$, pitch shift augmentation은 $\sigma_{p}=50$, reverberation은 $\sigma_{r}=60$의 setting을 사용함

실제로 해당 hyperparameter setting 하에서 augmentation probability $p=0.5$일 때 가장 낮은 WER을 달성함

특히 pre-training data 양이 적은 경우 (50 hours) data augmentation을 적용했을 때 $11\%$의 WER 개선을 얻을 수 있음

Lightweight and Dynamic Convolution
- Lightweight, dynamic convolution으로 replace 하는 경우 $5\%$의 WER 개선 효과를 얻을 수 있음

'Paper > Representation' 카테고리의 다른 글

[Paper 리뷰] Wav2Vec-C: A Self-Supervised Model for Speech Representation Learning (0)	2025.06.05
[Paper 리뷰] Wav2Vec-Switch: Contrastive Learning from Original-Noisy Speech Pairs for Robust Speech Recognition (0)	2025.06.04
[Paper 리뷰] W2V-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training (0)	2025.05.26
[Paper 리뷰] Emotion2Vec: Self-Supervised Pre-Training for Speech Emotion Representation (0)	2025.05.24
[Paper 리뷰] ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers (0)	2025.05.18

최근에 올라온 글

최근에 달린 댓글

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Wav2Vec-Aug: Improved Self-Supervised Training with Limited Data

Wav2Vec-Aug: Improved Self-Supervised Training with Limited Data

1. Introduction

2. Background

3. Method

- Data Augmentation

- Architectural Improvements

4. Experiments

- Settings

- Results

'Paper > Representation' 카테고리의 다른 글

티스토리툴바