[Paper 리뷰] Data2Vec: A General Framework for Self-Supervised Learning in Speech, Vision and Language

티스토리 뷰

Paper/Representation

[Paper 리뷰] Data2Vec: A General Framework for Self-Supervised Learning in Speech, Vision and Language

feVeRin 2025. 4. 5. 11:11

Data2Vec: A General Framework for Self-Supervised Learning in Speech, Vision and Language

Self-supervised learning은 single modality에 초점을 두고 있음
Data2Vec
- Speech, NLP, vision에 동일한 learning method를 적용하는 self-supervised framework
- Standard transformer architecture를 사용하고, self-distillation setup에서 input의 masked view를 기반으로 full input data의 latent representation을 predict
  - Modality-specific target 대신 entire input의 information이 포함된 contextualized latent representation을 predict 함
논문 (ICML 2022) : Paper Link

1. Introduction

Self-Supervised Learning은 human-annotated label 없이 representation을 build 하여 Natural Language Processing (NLP), speech processing, computer vision 등의 task에서 활용되고 있음
- BUT, 대부분의 self-supervised algorithm은 individual modality에 focus 하므로 specific-design과 learning bias 문제가 존재함
- 대표적으로 Wav2Vec 2.0과 같은 speech model에는 NLP의 word와 같이 self-supervised learning task를 define 할 수 있는 speech unit이 없음
- 한편으로 learning bias의 경우, other modality에 대한 generalization을 방해할 수 있음

-> 그래서 image, speech, text 등의 다양한 modality에서 동작하는 general self-supervised learning framework인 Data2Vec을 제안

Data2Vec
- Masked prediction을 latent target representation learning과 combine 하고 multiple network layer을 target으로 사용하여 generalize
- 특히 Teacher/Student mode에서 off-the-shelf Transformer network를 training
  1. 먼저 teacher mode를 통해 learning task에서 target 역할을 하는 full input data의 representation을 build 함
  2. 다음으로 student mode를 통해 full data representation을 predict 하는 input sample의 masked version을 encoding 함
- 추가적으로 다양한 modality에 대한 modality-specific feature encoder와 masking strategy를 적용

< Overall of Data2Vec >

Standard Transformer를 기반으로 다양한 modality를 지원하는 general self-supervised representation
결과적으로 contextualized, continuous representation을 통해 기존 baseline 이상의 성능을 달성

2. Method

Data2Vec은 input의 partial view가 주어진 full input data의 model representation을 predict 하는 방식으로 training 됨
- 먼저 training sample의 masked version을 encode 하여 student mode의 model을 얻음
- 다음으로 model weight가 exponential moving average로 parameterize 된 same model로 input의 unmasked version을 encode 하여 teacher mode의 model을 얻음
- Target representation은 training sample의 all information을 encoding 하고, 이때 learning task는 student가 input의 partial view에서 해당 representation을 predict 하는 것을 목표로 함

- Model Architecture

논문은 Input data에 대한 modality-specific encoding을 사용하는 standard Transformer architecture를 채택함
- Computer vision task의 경우, image를 patch sequence로 encoding 하는 ViT-Strategy를 활용하여 linear transformation에 input 함
- Speech data의 경우 16kHz waveform을 50Hz representation에 mapping 하는 multi-layer 1D convolutional neural network인 Wav2Vec 2.0을 사용하여 encoding 함
- Text의 경우, sub-word unit을 얻기 위해 pre-process 되고 learned embedding vector를 통해 distributional space에 embed 됨

- Masking

Data2Vec은 input sample을 token sequence로 embed 한 후, learned $\text{MASK}$ embedding token으로 replace 하여 해당 unit을 partially mask 하여 Transformer network에 전달함
- 이때 speech는 latent speech representation의 span을 masking 하고 language의 경우 token을 masking 함
- Computer vision task의 경우 block-wise masking strategy를 채택함

- Training Targets

Data2Vec은 masked sample의 encoding을 기반으로 original unmasked training sample의 model representation을 predict 하도록 training 됨
- 이때 masked time-step에 대해서만 model representation을 predict 함
- Predicted representation은 contextualized representation으로써, 특정 time-step을 encoding 하지만 Transformer network에서 self-attention을 통해 sample의 other information도 encoding 할 수 있음
  - 즉, contextual information이 부족한 target을 predict 하는 기존의 Wav2Vec 2.0, BERT 등과는 다름
Teacher Parameterization
- Unmasked training sample의 encoding은 model parameter $\theta$의 Exponentially Moving Average (EMA)를 통해 parameterize 됨
  1. 그러면 target-mode $\Delta$의 model weight는:
    (Eq. 1) $\Delta\leftarrow \tau\Delta+(1-\tau)\theta$
  2. 이때 $\tau$에 대한 schedule을 사용하여 해당 parameter를 first $\tau_{n}$ update에 대해 target value $\tau_{e}$까지 linearly increase 하고, 이후에는 constant 하게 keeping 함
    - 해당 strategy는 teacher model이 training beginning에서 frequently update 되고 good parameter가 학습된 training 후반에는 less update 되도록 함
- 추가적으로 teacher, student network 간에 feature encoder와 positional encoder의 paramter를 share 하면 더 나은 성능을 달성할 수 있음
Targets
- Training target은 student mode에서 masked time-step에 대한 teacher network의 top-$K$ block의 output을 기반으로 구성됨
- 먼저 time-step $t$에서 block $l$의 output을 $a_{t}^{l}$이라고 하자
  1. Time-step $t$에 대한 training target $y_{t}$를 얻기 위해, total $L$ block이 있는 network에 대해 top-$K$ block을 average $y_{t}=\frac{1}{K}\sum_{l=L-K+1}^{L}\hat{a}_{t}^{l}$하고 normalization을 적용함
  2. 이를 통해 studet mode에서 model에 의해 regress 되는 training target을 얻을 수 있음
    - 특히 averaging은 dedicated projection으로 각 block을 separately predict 하는 것보다 더 efficient 함
- Target을 normalizing 하면 all time-step에 대해 constant representation으로 collapsing 하는 것을 방지할 수 있고, high-norm을 가진 layer가 target feature를 dominate 하는 것을 방지함
  1. Speech representation의 경우 neighboring representation이 highly-correlate 되어 있으므로 current input sample에 대해 learned parameter 없이 instance normalization을 사용함
  2. NLP, vision의 경우 parameter-less layer normalization을 채택함
    - Variance-Invariance-Covariance regularization도 고려할 수 있지만 layer normalization이 additional hyper-parameter 없이 더 효과적으로 동작함

- Objective

Contextualized training target $y_{t}$가 주어졌을 때,
- Data2Vec은 smooth $L_{1}$ loss를 사용하여 target을 regress 함:
  (Eq. 2) $\mathcal{L}(y_{t},f_{t}(x))=\left\{\begin{matrix}
  \frac{1}{2}(y_{t}-f_{t}(x))^{2}/\beta, & |y_{t}-f_{t}(x)|\leq \beta \\
  (|y_{t}-f_{t}(x)|-\frac{1}{2}\beta), & \text{otherwise} \\
  \end{matrix}\right.$
  - $\beta$ : time-step $t$에서 target $y_{t}$와 model prediction $f_{t}(x)$ 간의 gap에 따라, squared loss와 $L_{1}$ loss의 transition을 control 하는 역할
- (Eq. 2)의 loss는 outlier에 less sensitive 하다는 장점이 있음
  - 대신 $\beta$에 대한 tuning이 필요함

3. Experiments

- Settings

Dataset
- Vision : ImageNet
- NLP : Books, English Wikipedia
- Speech : LibriSpeech
Comparisons
- Vision : BEiT, PeCo, MoCo, DINO, MAE, SimMIM, iBOT, MaskFeat
- NLP : BERT
- Speech : Wav2Vec 2.0, HuBERT, WavLM

- Results

Computer Vision
- Data2Vec은 기존 single model 보다 우수한 성능을 달성함
- Multiple model 측면에서도 PeCo 수준의 성능을 보임

Speech and Audio Processing
- Speech processing 측면에서도 기존보다 뛰어난 성능을 보임

Pre-training setup에서도 가장 높은 mAP를 달성함

Pre-Training을 활용한 Audio Event Classification 성능

Natural Language Processing
- BERT와의 비교에서도 Data2Vec의 성능이 더 뛰어남

Layer-Averaged Targets
- Multiple layer에 기반한 target이 모든 modality에 대해 top layer $K=1$만 사용하는 것보다 더 개선된 성능을 보임
- 즉, multiple layer feature를 활용하면 self-supervised task를 enrich 하고 accuracy를 향상할 수 있음

Target Contextualization
- Larger context size를 사용하면 더 나은 downstream performance를 달성할 수 있음

Target Feature Type
- Feed-Forward Network (FFN) block output이 performance에 가장 높은 영향을 미침
- 반면 self-attention block은 feature가 다른 time-step에 heavily bias 되어 있음

'Paper > Representation' 카테고리의 다른 글

[Paper 리뷰] Data2Vec-AQC: Search for the Right Teaching Assistant in the Teacher-Student Training Setup (0)	2025.04.10
[Paper 리뷰] Data2Vec 2.0: Efficient Self-Supervised Learning with Contextualized Target Representations for Vision, Speech and Language (0)	2025.04.06
[Paper 리뷰] XLSR: Unsupervised Cross-Lingual Representation Learning for Speech Recognition (0)	2025.04.04
[Paper 리뷰] Wav2Vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (0)	2025.03.23
[Paper 리뷰] Wav2Vec: Unsupervised Pre-Training for Speech Recognition (0)	2025.03.22

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Data2Vec: A General Framework for Self-Supervised Learning in Speech, Vision and Language

Data2Vec: A General Framework for Self-Supervised Learning in Speech, Vision and Language

1. Introduction

2. Method

- Model Architecture

- Masking

- Training Targets

- Objective

3. Experiments

- Settings

- Results

'Paper > Representation' 카테고리의 다른 글

티스토리툴바