[Paper 리뷰] Robust Data2Vec: Noise-Robust Speech Representation Learning for ASR by Combining Regression and Improved Contrastive Learning

티스토리 뷰

Paper/Representation

[Paper 리뷰] Robust Data2Vec: Noise-Robust Speech Representation Learning for ASR by Combining Regression and Improved Contrastive Learning

feVeRin 2025. 4. 11. 18:31

Robust Data2Vec: Noise-Robust Speech Representation Learning for ASR by Combining Regression and Improved Contrastive Learning

Contrastive learning과 regression task에 기반한 self-supervised pre-training method를 통해 Automatic Speech Recognition 성능을 향상할 수 있음
Robust Data2Vec
- Pre-training stage에서 contrastive learning과 regression task를 jointly optimizing
- 추가적으로 patch-based non-semantic negative sample과 positive/negative sample의 distribution을 analyze 하여 discriminative capacity를 향상
논문 (ICASSP 2023) : Paper Link

1. Introduction

Automatic Speech Recognition (ASR)을 위해 unlabled speech data에 대한 self-supervised pre-training, teacher-student scheme을 활용할 수 있음
- 대표적으로 Wav2Vec 2.0은 contrastive loss function을 사용하여 predicted sample과 positive/negative sample에 대한 self-supervised pre-training을 수행함
- 한편으로 Data2Vec은 teacher-student method를 통한 regression을 기반으로 continuous contextual representation을 얻을 수 있음
- 이때 regression과 contrastive learning task를 combining 하면 speech representation의 robustness를 개선할 수 있음

-> 그래서 Data2Vec에 regression과 contrastive learning을 모두 적용한 Robust Data2Vec을 제안

Robust Data2Vec
- Data2Vec을 기반으로 high-level noise-robust representation을 학습할 수 있는 patch-based non-semantic negative sample을 도입
- 추가적으로 positive/negative sample 간의 cosine-similarity를 활용하여 easier negative sample에 대한 incremental removal을 수행

< Overall of Robust Data2Vec >

Regression, contrastive learning task를 jointly optimize 한 noise-robust self-supervised representation
결과적으로 noisy scenario에서 기존보다 더 우수한 성능을 달성

2. Method

- Joint Training Regression and Contrastive Learning Tasks

Robust Data2Vec은 Data2Vec을 기반으로 구성됨
- 먼저 Data2Vec은 feature encoder와 multi-layer Transformer encoder로 구성되고, feature encoder는 7-convolutional layer를 가짐
  1. Teacher model은 student model에 대한 Moving Average (MA)를 통해 얻어지고 model training을 위한 target을 제공함
  2. Student model의 training parameter를 $\theta$라고 하면 teacher model parameter $\Delta$는 $\Delta\leftarrow \tau\Delta +(1-\tau)\theta$와 같이 얻어짐
    - 여기서 $\tau$는 $\tau_{0}$에서 $\tau_{e}$까지 $\tau_{n}$ 만큼 linear increase 하고 이후에는 constant 하게 유지됨
  3. 해당 strategy를 활용하면 early stage에서는 빠르게, later stage에서는 느리게 update 할 수 있음
- Student model input을 noisy speech waveform $x_{\text{noisy}}$, teacher model input을 original speech waveform $x_{\text{origin}}$이라고 하자
  1. 이때 두 input을 feature encoder에 전달하여 각각 noisy feature $z_{\text{noisy}}$와 original feature $z_{\text{origin}}$을 얻을 수 있음
  2. 이후 masked noisy/original feature를 Transformer encoder로 전달하여 각각 predicted output $c_{\text{pre}}$와 target $c_{\text{origin}}$을 output 함
    - Time step $t$에서 $l$-th layer의 Transformer encoder output은 $c_{t}^{l}$과 같음
  3. 그러면 predicted output $c_{\text{pre}}$와 averaged target $c_{\text{tar}}$ 간의 regression, contrastive loss를 얻을 수 있음
    - $c_{\text{tar}_{t}}=\frac{1}{M}\sum_{l=L-M+1}^{L}c_{\text{origin}_{t}}^{l}$은 top-$M$ layer Transformer encoder의 averaged output representation이 됨
- 결과적으로 논문의 regression loss $\mathcal{L}_{reg}$는:
  (Eq. 1) $\mathcal{L}_{reg}=\left\{\begin{matrix}
  \frac{1}{2}(c_{\text{pre}_{t}}-c_{\text{tar}_{t}})^{2}/\beta, & |c_{\text{pre}_{t}}-c_{\text{tar}_{t}}| \leq \beta \\
  (|c_{\text{pre}_{t}}-c_{\text{tar}_{t}}|-\frac{1}{2}\beta), & \text{otherwise} \\
  \end{matrix}\right.$
  - $\beta$ : squared loss, $L1$ loss 간의 transition을 control 하는 역할
- Contrastive loss $\mathcal{L}_{c}$는:
  (Eq. 2) $\mathcal{L}_{c}=-\log \frac{\exp(\text{sim}(c_{\text{pre}_{t}},c_{\text{tar}_{t}})/\kappa)}{\sum_{\tilde{c}\sim\{c_{\text{tar},c_{n},c_{ns}}\}}\exp(\text{sim}(c_{\text{pre}_{t}},\tilde{c})/\kappa)}$
- 최종적으로 total loss $\mathcal{L}_{total}$은:
  (Eq. 3) $\mathcal{L}_{total}=\mathcal{L}_{reg}+\lambda\mathcal{L}_{c}$
  - $\lambda$ : hyperparameter

- Contrastive Learning with Non-Semantic Negative Samples

Pre-trained model은 noise-perturbed input feature에 대해 robust high-level representation을 학습해야 함
- 따라서 pre-training stage에서 noisy, perturbed sample을 negative sample (non-semantic negative sample)로 구성하여 high-level robust representation을 학습할 수 있도록 함
- 이를 위해 patch-based approach를 활용하여 non-semantic negative sample을 구축함
  1. 이때 output feature는 time/dimensional axis를 따라 서로 다른 size의 patch로 slice 되고 randomly shuffle 됨
  2. 특히 frame level에서 speech negative sample을 구성하는 경우, $N$ frame의 standard negative sample과 $N$ frame의 non-semantic negative sample을 randomly select 할 수 있음
    - Standard negative sample은 original feature에서, non-semantic negative sample은 distributed feature에서 select 됨
- $c_{t}, c_{p}, c_{n}, c_{ns}$를 각각 query sample, positive sample, standard negative sample, non-semantic negative sample이라고 했을 때, non-semantic contrastive loss는:
  (Eq. 4) $ \mathcal{L}_{c}=-\log \frac{\exp(\text{sim}(c_{t},c_{p})/\kappa)}{\sum_{\tilde{c}\sim \{ c_{p},c_{n},c_{ns}\}}\exp (\text{sim}(c_{t},\tilde{c})/\kappa)} $
  - $\kappa$ : temperature, $\text{sim}$ : cosine-similarity
- Non-semantic negative sample이 exclude 되는 경우, $\mathcal{L}_{c}$는 standard contrastive loss function으로 reduce 됨

(상) Original Fbank (하) Patch-based Shuffled Feature

- Removal of the More Distinguishable Negative Samples

Distinguishable negative sample을 remove 하기 위해서는 contrastive learning에서 positive/negative sample의 distribution을 analyze 해야 함
- 이를 위해 pre-trained Data2Vec model을 활용하여 서로 다른 epoch에서 predicted-positive sample pair와 predicted-negative sample pair의 cosine similarity를 calculate 해보면:
  1. 아래 그림과 같이 pre-training epoch이 많을수록 positive/negative sample에 대해 더 discriminative 해짐
  2. 결과적으로 distinguishable negative sample의 impact는 epoch이 증가할수록 weaken 됨
- 따라서 논문은 smaller predicted negative cosine similarity score를 가진 negative sample을 remove 함
  - 즉, pre-training difficulty를 높이기 위해 $N$ negative sample 중에서 highest cosine similarity를 가지는 $k$ sample을 select 하여 robust representation을 얻을 수 있음

3. Experiments

- Settings

Dataset : CHiME-4
Comparisons : Wav2Vec 2.0, Data2Vec, HuBERT

- Results

전체적으로 Robust Data2Vec의 성능이 가장 우수함

Ablation Study
- 각 loss component를 추가할수록 WER이 향상됨

Impact of Negative Samples
- Additional non-semantic negatives를 pre-training phase에 활용하는 경우 성능을 향상할 수 있음

Visualization
- Training process에서 negative sample이 증가할수록 unit cosine similarity를 가지는 sample 수가 증가하여 discriminative capacity가 향상됨
- 이를 통해 noise-robust representation을 얻을 수 있음

Cosine Similarity (좌) predicted-negative samples (우) predicted-positive samples

'Paper > Representation' 카테고리의 다른 글

[Paper 리뷰] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units (0)	2025.04.13
[Paper 리뷰] Data2Vec-AQC: Search for the Right Teaching Assistant in the Teacher-Student Training Setup (0)	2025.04.10
[Paper 리뷰] Data2Vec 2.0: Efficient Self-Supervised Learning with Contextualized Target Representations for Vision, Speech and Language (0)	2025.04.06
[Paper 리뷰] Data2Vec: A General Framework for Self-Supervised Learning in Speech, Vision and Language (0)	2025.04.05
[Paper 리뷰] XLSR: Unsupervised Cross-Lingual Representation Learning for Speech Recognition (0)	2025.04.04

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Robust Data2Vec: Noise-Robust Speech Representation Learning for ASR by Combining Regression and Improved Contrastive Learning

Robust Data2Vec: Noise-Robust Speech Representation Learning for ASR by Combining Regression and Improved Contrastive Learning

1. Introduction

2. Method

- Joint Training Regression and Contrastive Learning Tasks

- Contrastive Learning with Non-Semantic Negative Samples

- Removal of the More Distinguishable Negative Samples

3. Experiments

- Settings

- Results

'Paper > Representation' 카테고리의 다른 글

티스토리툴바