[Paper 리뷰] Emotion2Vec: Self-Supervised Pre-Training for Speech Emotion Representation

티스토리 뷰

Paper/Representation

[Paper 리뷰] Emotion2Vec: Self-Supervised Pre-Training for Speech Emotion Representation

feVeRin 2025. 5. 24. 07:41

Emotion2Vec: Self-Supervised Pre-Training for Speech Emotion Representation

Universal speech emotion representation이 필요함
Emotion2Vec
- Self-Supervised Online Distillation을 통해 unlabeled emotion data로 pre-training
- Pre-training 시 utterance-level loss와 frame-level loss를 combine
논문 (ACL 2024) : Paper Link

1. Introduction

Speech에서 emotion을 추출하기 위해서는 주로 Filter Bank (FBank)나 MFCC를 활용함
- BUT, 해당 feature는 rich semantic information이 부족하므로 emotional task에서 활용하기에 한계가 있음
- 한편으로 최근의 Self-Supervised Learning (SSL)은 feature extraction 측면에서 우수한 성능을 보임
  - BUT, 해당 speech-based SSL model은 emotion task에 적합하지 않음

-> 그래서 universal speech-based emotion representation model인 Emotion2Vec을 제안

Emotion2Vec
- 262 hours의 open-source emotion data를 기반으로 Online Distillation paradigm을 적용하여 self-supervised pre-training을 수행
- Utterance-level loss와 Frame-level loss를 combine 하여 whole-play information과 local detail을 모두 반영

< Overall of Emotion2Vec >

Online Distillation과 Utterance, Frame-level loss를 활용한 universal emotion representation
결과적으로 다양한 emotion task에서 기존보다 우수한 성능을 달성

2. Method

Emotion2Vec은 Utterance-level loss와 Frame-level loss를 combine 하여 Online Distillation에서 사용함
- Utterance, Frame-level loss를 combine 하면 global, local information이 각각 speech의 emotion을 convey 하는 것을 반영할 수 있음
- 추가적으로 teacher-student network initializing을 통해 Online Distillation process를 warm-up 하고 self-supervised bootstrap learning에 대한 better representation을 제공함

- Model Pipeline

Emotion2Vec은 pre-training phase에서 teacher network $\mathcal{T}$와 student network $\mathcal{S}$를 사용함
- 구조적으로는 multi-layer convolutional neural network로 구성된 feature extractor $\mathcal{F}$와 multi-layer Transformer로 구성된 backbone network $\mathcal{B}$를 활용함
- 먼저 raw audio utterance $X=[x_{1},...,x_{N_{x}}]$가 주어지면,
  1. Teacher $\mathcal{T}$와 student $\mathcal{S}$는 각각 feature extractor $\mathcal{F}^{\mathcal{T}}, \mathcal{F}^{\mathcal{S}}$를 사용하여 downsampled feature $Z_{0}=[z_{1},...,z_{N_{z}}]$를 얻음:
    (Eq. 1) $Z_{0}^{\mathcal{T}}=\mathcal{F}^{\mathcal{T}}(X)$
    (Eq. 2) $Z_{0}^{\mathcal{S}}=\mathcal{F}^{\mathcal{S}}(X)$
  2. Teacher network $\mathcal{T}$의 경우, downsampled feature $Z_{0}^{\mathcal{T}}$가 backbone network $\mathcal{B}^{\mathcal{T}}$에 direclty fed 됨
    - Student network $\mathcal{S}$의 경우, downsampled feature $Z_{0}^{\mathcal{S}}$는 각 frame에 대해 probability $p$를 사용하여 $l$ consecutive frame을 mask 함
  3. 이후 learnable utterance embedding $U=[u_{1},...,u_{N_{u}}]$는 backbone network $\mathcal{B}^{\mathcal{S}}$에 fed 됨:
    (Eq. 3) $Z_{i}^{\mathcal{T}}=\mathcal{B}^{\mathcal{T}}_{i}\left(Z_{i-1}^{\mathcal{T}}\right)$
    (Eq. 4) $Y^{\mathcal{T}}=\frac{1}{k}\sum_{i=n-k+1}^{n}Z_{i}^{\mathcal{T}}$
    (Eq. 5) $U^{\mathcal{S}};Y^{\mathcal{S}}=\mathcal{B}^{\mathcal{S}}\left(U;\text{Mask}(Z_{0}^{\mathcal{S}})\right)$
    - $Y^{\mathcal{T}}$ : $n$ Transformer block $\mathcal{B}^{\mathcal{T}}_{i}$에서 top-$k$ output embedding의 average, $\text{Mask}$ : masking operation
    - Utterance-level output embedding $U^{\mathcal{S}}$와 frame-level output embedding $Y^{\mathcal{S}}$는 student backbone network $\mathcal{B}^{\mathcal{S}}$의 output에 해당함
- $Y^{\mathcal{T}}, Y^{\mathcal{S}},U^{\mathcal{S}}$는 hidden layer dimension에서 동일함
  - 이때 $Y^{\mathcal{T}}, Y^{\mathcal{S}}$는 동일한 $N_{z}$ temporal dimension을 가지고, $U^{\mathcal{S}}$는 $N_{u}$ temporal dimension을 가짐

- Utterance-level Loss

Utterance-level loss는 global emotion을 학습하기 위한 utterance-level pretext task를 구성함
- 이때 논문은 Mean Squared Error (MSE)를 사용하여 loss를 compute 함:
  (Eq. 6) $\mathcal{L}_{Utt}=\left(\bar{Y}^{\mathcal{T}}-\bar{U}^{\mathcal{S}}\right)^{2}$
- 여기서:
  (Eq. 7) $\bar{Y}^{\mathcal{T}}=\frac{1}{N_{z}}\sum_{i=1}^{N_{z}}Y_{i}^{\mathcal{T}}$
  (Eq. 8) $\bar{U}^{\mathcal{S}}=\frac{1}{N_{u}}\sum_{i=1}^{N_{u}}U_{i}^{\mathcal{S}}$
  - 즉, utterance-level loss $\mathcal{L}_{Utt}$는 $Y^{\mathcal{T}},U^{\mathcal{S}}$의 temporal pooling result에 의해 compute 됨
- 한편으로 Utterance-level loss는 다음의 3가지 방식으로 compute 될 수 있음:
  1. Token Embedding
    - Token embedding은 student network $\mathcal{S}$에 의해 encode 된 global emotion information을 represent 하기 위해 single token을 활용함
    - 즉, learnable utterance embedding $U=[u_{1},...,u_{N_{u}}]$에서 $N_{u}=1$로 설정함
  2. Chunk Embedding
    - Chunk embedding은 multiple token을 활용하여 global emotion information을 represent 함
    - 이 경우, chunk 내에서 더 많은 global information을 aggregate 할 수 있음
  3. Global Embedding
    - Global embedding은 additional utterance token을 add 하지 않음
    - 이때 loss를 compute 하기 위해 $U^{\mathcal{S}}$ 대신 frame-level output embedding $Y^{\mathcal{S}}$의 temporal pooling을 사용함

- Frame-level Loss

Frame-level loss는 context emotion을 학습하기 위한 frame-wise pretext task를 구성함
- 여기서 논문은 Mask Language Modeling (MLM)을 따라 masked part에 대한 loss만 compute 함
- 결과적으로 frame-level loss $\mathcal{L}_{Frm}$은:
  (Eq. 9) $ \mathcal{L}_{Frm}=\frac{1}{M}\sum_{i\in\mathbb{M}}\left(Y_{i}^{\mathcal{T}}-\mathcal{Y}_{i}^{\mathcal{S}}\right)^{2}$
  - $\mathbb{M}$ : Mask 된 frame-level output embedding $Y^{\mathcal{S}}$의 index sequence
  - $M$ : mask 된 총 token 수

- Online Distillation

Online distillation은 teacher-student learning에 대한 self-supervised learning strategy에 해당함
- 이때 student network는 backpropagation을 통해 update 되고 teacher network는 Exponentially Moving Average (EMA)를 통해 update 됨
  1. Student network $\mathcal{S}$의 경우, backpropagation을 위한 total loss $\mathcal{L}$은 frame-level loss $\mathcal{L}_{Frm}$과 utterance-level loss $\mathcal{L}_{Utt}$를 combine 하여 얻어짐:
    (Eq. 10) $\mathcal{L}=\mathcal{L}_{Frm}+\alpha\mathcal{L}_{Utt}$
    - $\alpha$ : tunable weight
  2. Teacher network $\mathcal{T}$의 경우, parameter $\theta_{0}^{\mathcal{T}}$는 student network $\theta_{0}^{\mathcal{S}}$와 동일하게 initialize 된 다음 각 mini-batch 내에서 EMA로 update 됨:
    (Eq. 11) $\theta_{t+1}^{\mathcal{T}}=\tau\theta_{t}^{\mathcal{T}}+(1-\tau)\theta_{t+1}^{\mathcal{S}}$
    - $\tau$ : pre-training 동안 linearly increase 하는 parameter
- 즉, 각 mini-batch 내에서 teacher feature extractor $\mathcal{F}^{\mathcal{T}}$의 parameter는 $\mathcal{F}^{\mathcal{S}}$에서 directly copy 되고, teacher backbone network $\mathcal{B}^{\mathcal{T}}$의 parameter는 $\mathcal{B}^{\mathcal{S}}, \mathcal{B}^{\mathcal{T}}$의 EMA를 통해 update 됨

3. Experiments

- Settings

Dataset : 아래 표 참조
Comparisons : Wav2Vec, Wav2Vec 2.0, VQ-Wav2Vec, HuBERT, WavLM, Data2Vec, Data2Vec 2.0

- Results

전체적으로 Emotion2Vec의 성능이 가장 뛰어남

Language Generalization
- Emotion2Vec은 English dataset에서 우수한 성능을 보임

English가 아닌 다른 language에 대해서도 Emotion2Vec의 성능이 가장 뛰어남

Song Emotion Recognition
- Emotion2Vec은 추가적인 fine-tuning 없이도 최고의 성능을 달성함

Emotion Prediction in Conversation
- Emotion Prediction in Conversation task에서도 Emotion2Vec이 효과적임

Sentiment Analysis
- Sentiment Analysis 측면에서도 최고의 CMU-MOSI, CMU-MOSEI를 달성함

Visualization
- UMAP visualization을 적용해 보면, 아래 그림의 (a) WavLM과 (b) Data2Vec와 같이 high, low arousal emotion class 간의 heavy overlapping이 나타남
- 반면 (c) Emotion2Vec의 경우 high, low arousal representation이 cluster 되어 나타남

'Paper > Representation' 카테고리의 다른 글

[Paper 리뷰] Wav2Vec-Aug: Improved Self-Supervised Training with Limited Data (0)	2025.06.02
[Paper 리뷰] W2V-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training (0)	2025.05.26
[Paper 리뷰] ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers (0)	2025.05.18
[Paper 리뷰] Multi-Resolution HuBERT: Multi-Resolution Speech Self-Supervised Learning with Masked Unit Prediction (0)	2025.05.17
[Paper 리뷰] LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT (0)	2025.05.14

최근에 올라온 글

최근에 달린 댓글

« 2025/12 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Emotion2Vec: Self-Supervised Pre-Training for Speech Emotion Representation

Emotion2Vec: Self-Supervised Pre-Training for Speech Emotion Representation

1. Introduction

2. Method

- Model Pipeline

- Utterance-level Loss

- Frame-level Loss

- Online Distillation

3. Experiments

- Settings

- Results

'Paper > Representation' 카테고리의 다른 글

티스토리툴바