[Paper 리뷰] Wav2Vec-Switch: Contrastive Learning from Original-Noisy Speech Pairs for Robust Speech Recognition

티스토리 뷰

Paper/Representation

[Paper 리뷰] Wav2Vec-Switch: Contrastive Learning from Original-Noisy Speech Pairs for Robust Speech Recognition

feVeRin 2025. 6. 4. 17:25

Wav2Vec-Switch: Contrastive Learning from Original-Noisy Speech Pairs for Robust Speech Recognition

Self-Supervised Learning framework는 noise robustness를 고려하지 않음
Wav2Vec-Switch
- Original-noisy speech pair를 Wav2Vec 2.0 network에 simultaneously feed
- Original, noisy speech에 대한 quantized representation을 서로에 대한 additional prediction target으로 활용
논문 (ICASSP 2022) : Paper Link

1. Introduction

Speech task에 대한 Self-Supervised Learning (SSL)은 pre-training stage에서 unlabeled data를 활용하여 input speech로부터 contextualized representation을 학습할 수 있음
- 이후 small transcribed data를 통해 supervised manner로 pre-trained model을 fine-tuning 하면 더 나은 성능을 달성할 수 있음
  - 특히 Automatic Speech Recognition (ASR) task에서 해당 SSL method는 우수한 성능을 달성함
- BUT, real-world application에서 speech recording은 background noise를 포함하고 있으므로 speech model은 noise robustness를 가져야 함
  - 이를 해결하기 위해 enhancement/denoising module을 적용하는 것을 고려할 수 있지만, 이 경우 neural network의 complexity가 증가됨

-> 그래서 pre-trained SSL model의 noise robustness를 향상한 Wav2Vec-Switch를 제안

Wav2Vec-Switch
- Wav2Vec 2.0을 기반으로 noise robustness를 달성하기 위해 auxiliary contrastive loss를 반영
- 특히 input으로 original speech와 noisy version의 pair를 활용
  - 이를 통해 network에 대한 complexity를 증가시키지 않으면서 contrastive loss에서 prediction consistency constraint를 enforce 함

< Overall of Wav2Vec-Switch >

Noise robust contextualized representation을 학습하는 speech SSL model
결과적으로 기존보다 우수한 성능을 달성

2. Method

- Wav2Vec 2.0

Wav2Vec 2.0은 masked prediction과 contrastive learning을 pre-training 중에 unified model로써 combine 함
- 구조적으로 Wav2Vec 2.0은 다음과 같이 구성됨:
  1. Raw audio waveform $\mathbf{x}\in\mathbb{R}^{T}$를 input으로 하는 feature encoder $f:\mathcal{X}\mapsto \mathcal{Z}$는:
    - Convolutional block을 통해 time-domain down-sampling을 적용하여 latent representation $Z=[\mathbf{z}_{1},...,\mathbf{z}_{T'}]$을 output 함
  2. Context network $g:\mathcal{Z}\mapsto\mathcal{C}$는:
    - Masked input $Z$를 기반으로 각 masked position $t$에서 Transformer block을 통해 contextualized representation $\mathbf{c}_{t}$를 output 함
  3. Quantization module $h:\mathcal{Z}\mapsto \mathcal{Q}$는:
    - Finite codebook set에서 Gumbel Softmax와 product quantization을 통해 unmasked $Z$를 $Q$로 discretize 함
- Contrastive loss는 각 masked position $t$에 적용되어 true quantization representation $\mathbf{q}_{t}$ (positive sample)과, same training example 내의 other masked position에서 얻어진 $K$ distractor $Q_{t}^{-}=\{\mathbf{q}_{t}^{-},...,\mathbf{q}_{K}^{-} \}$를 discriminate 함:
  (Eq. 1) $\mathcal{L}^{C}(C,Q)=\sum_{t=1}^{N}\mathcal{L}_{t}^{C}(C,Q)/N$
  (Eq. 2) $\mathcal{L}_{t}^{C}(C,Q)=-\log\frac{\exp(\text{sim}(\mathbf{c}_{t},\mathbf{q}_{t}))}{\sum_{\mathbf{q}^{-}\in Q_{t}^{-}}\exp(\text{sim}(\mathbf{c}_{t},\mathbf{q}^{-}))}$
  - $N$ : masked position 수, $\text{sim}(\cdot,\cdot)$ : cosine similarity
- 추가적으로 Wav2Vec 2.0은 codebook utilization을 위해 Gumbel Softmax output의 negative perplexity로 얻어지는 diversity loss $\mathcal{L}^{D}$를 도입하고, 이때 total loss는:
  (Eq. 3) $\mathcal{L}=\mathcal{L}^{C}+\alpha\mathcal{L}^{D}$
  - $\alpha$ : coefficient
- Fine-tuning 중에 quantization module은 discard 되고 feature encoder parameter는 frozen 됨
  - 다른 모든 network parameter는 CTC loss로 update 됨

- Wav2Vec-Switch

Wav2Vec 2.0 pre-training에서 distractors는 same training example 내의 masked position에서만 sampling 되어야 함
- Same training example 내에서만 sampling 하면 speaker, environmental characteristic과 같은 ASR과 irrelevant feature를 학습하는 것을 방지할 수 있기 때문
- BUT, pre-training 중에 noise robustness를 달성하는 mechanism은 설계되지 않음
  1. 즉, noisy utterance가 주어졌을 때 contrastive loss가 사용하는 positive, negative sample 모두에 noise가 포함되어 있으므로,
    - Noise를 speech에서 differentiate 하는 explicit way나 noise에 invariant 한 contextualized representation을 학습하는 방법이 필요함
  2. 이때 직관적으로 contextualized representation이 noise-robust 하다면, original/noisy speech의 representation을 기반으로 noisy/original speech에 대한 target을 predict 할 수 있어야 함
    - 따라서 논문은 해당 intuition을 기반으로 Wav2Vec-Switch를 구성함
- Batch size $B$에 대해 original waveform $X\in\mathbb{R}^{B\times T}$이 주어지면, $X$를 duplicate 하고 각 row (example)에 independently sample 된 noise를 적용하여 $X$의 noisy version인 $\tilde{X}$를 얻음
  1. 다음으로 $X,\tilde{X}$가 Wav2Vec 2.0 network에 forward 되어 feature encoder $f$, context network $g$, quantization module $h$를 통과함
  2. 이를 통해 다음의 4가지 quantity를 얻을 수 있음:
    (Eq. 4) $C=g(f(X)),\,\,Q=h(f(X)),\,\,\tilde{C}=g(f(\tilde{X})),\,\,\tilde{Q}=h(f(\tilde{X}))$
  3. 추가적으로 $(C,Q), (\tilde{C},\tilde{Q})$를 input argument로 사용하는 (Eq. 1)의 standard contrastive loss 외에도, quantized target $Q,\tilde{Q}$를 switch 하여 $(C,\tilde{Q}), (\tilde{C},Q)$의 2개의 tuple을 구성함
  4. 결과적으로 논문은 $\mathcal{L}^{C}(C,Q), \mathcal{L}^{C}(\tilde{C},\tilde{Q}), \mathcal{L}^{C}(C,\tilde{Q}),\mathcal{L}^{C}(\tilde{C},Q)$의 4가지 contrastive loss quantity를 사용하고, 이때 loss는:
    (Eq. 5) $\mathcal{L}_{switch}^{C}(C,Q,\tilde{C},\tilde{Q})= \mathcal{L}^{C}(C,Q)+\mathcal{L}^{C}(\tilde{C},\tilde{Q})+\lambda\left(\mathcal{L}^{C}(C,\tilde{Q})+ \mathcal{L}^{C}(\tilde{C},Q)\right)$
    - $\lambda$ : coefficient로써, switched target에서 calculated term의 weight를 control 하는 역할 (특히 $\lambda=0$인 경우, data augmentation이 포함된 Wav2Vec 2.0으로 볼 수 있음)
    - $\mathcal{L}^{D}$는 (Eq. 3)과 같이 total loss에 추가됨
- 한편으로 specific input pair $(X,\tilde{X})$에 대한 network의 internal state는 identical 해야 함
  1. 즉, context network의 masked position과 모든 dropout layer의 dropout mask가 $X, \tilde{X}$에 대해 identical 하다는 것을 의미함
    - 그렇지 않은 경우, original speech와 noisy version의 representation이 서로 일치하지 않으므로 meaningful interpretation을 가진 representation을 학습하기 어려움
  2. 실제로 논문은 $X,\tilde{X}$를 함께 batch 하여 얻어지는 large mini-batch를 network에 전달함
    - 특히 논문은 random function이 invoke 되기 전에 current random state를 save 함
    - 이후, 해당 random state를 mini-batch의 first-half에 대해 random function이 invoke 된 다음에 immediately restore 하고, same function을 second-half에 대해 execute 함
  3. 이를 통해 $X,\tilde{X}$에 대한 network internal state가 always identical 하도록 보장함

3. Experiments

- Settings

Dataset : LibriSpeech+MUSAN, CHiME-4
Comparisons : Wav2Vec 2.0

- Results

전체적으로 Wav2Vec-Switch의 성능이 가장 뛰어남

Mismatch condition에 대해서도 Wav2Vec-Switch가 우수한 성능을 보임

Results on Real Noisy Data
- CHiME-4 dataset에 대해서도 Wav2Vec-Switch는 뛰어난 성능을 달성함

Ablation Study
- 각 component를 제거하는 경우 성능 저하가 발생함

'Paper > Representation' 카테고리의 다른 글

[Paper 리뷰] SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data (0)	2025.06.07
[Paper 리뷰] Wav2Vec-C: A Self-Supervised Model for Speech Representation Learning (0)	2025.06.05
[Paper 리뷰] Wav2Vec-Aug: Improved Self-Supervised Training with Limited Data (0)	2025.06.02
[Paper 리뷰] W2V-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training (0)	2025.05.26
[Paper 리뷰] Emotion2Vec: Self-Supervised Pre-Training for Speech Emotion Representation (0)	2025.05.24

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Wav2Vec-Switch: Contrastive Learning from Original-Noisy Speech Pairs for Robust Speech Recognition

Wav2Vec-Switch: Contrastive Learning from Original-Noisy Speech Pairs for Robust Speech Recognition

1. Introduction

2. Method

- Wav2Vec 2.0

- Wav2Vec-Switch

3. Experiments

- Settings

- Results

'Paper > Representation' 카테고리의 다른 글

티스토리툴바