[Paper 리뷰] ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers

티스토리 뷰

Paper/Representation

[Paper 리뷰] ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers

feVeRin 2025. 5. 18. 08:38

ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers

Speech representation은 unwanted variation을 disentangle 할 수 있어야 함
ContentVec
- Content의 loss 없이 speaker disentanglement를 수행
- HuBERT를 기반으로 teacher, student를 모두 regularize 하는 disentangling method를 도입
논문 (ICML 2022) : Paper Link

1. Introduction

HuBERT와 같은 Speech Self-Supervised Learning (SSL)은 large-scale unannotated corpora에서 representation network를 training 하여 meaningful speech structure와 information을 capture 하는 것을 목표로 함
- 이를 통해 생성된 speech representation은 downstream task를 training 하는 데 사용됨
  - 즉, well-structured speech representation을 통해 downstream task에서 large-scale dataset에 대한 dependency를 줄일 수 있음
- 이때 desirable speech representation은 content information을 interfering variation으로부터 효과적으로 disentangle 할 수 있어야 함
  - BUT, 기존의 SSL method는 speaker variant disentanglement 측면에서 한계가 있음

-> 그래서 content loss 없이 speaker variant를 disentangle할 수 있는 ContentVec을 제안

ContentVec
- HuBERT를 기반으로 3가지 disentangling mechanism을 incorporate
  1. Disentanglement in Teacher
    - Teacher label에서 speaker information을 제거하는 것을 목표로 함
  2. Disentanglement in Student
    - Speech representation에서 speaker invariance를 enforce하는 regularization loss를 도입함
  3. Speaker Conditioning
    - Masked prediction task에 speaker information을 inputting하여 speech representation이 speaker information을 encode 하는 것을 완화함
- Speaker disentanglement를 통해 downstream task에 targeted information을 제공하고 powerful content processing을 지원

< Overall of ContentVec >

HuBERT를 기반으로 speaker variant를 효과적으로 disentangle한 SSL method
결과적으로 기존보다 우수한 성능을 달성

2. Method

Upper-cased latter $X,\mathbf{X}$는 각각 random scalar와 vector, lower-cased latter $x,\mathbf{x}$는 각각 deterministic scalar, vector를 나타낸다고 하자

- Problem Formulation

Total frame 수 $T$에 대해 $\mathbf{X}_{t}$를 frame $t$에서의 speech feature vector라 하고, $\mathbf{X}=[\mathbf{X}_{1},...,\mathbf{X}_{T}]$를 speech feature sequnece라고 하자
- ContentVec은 frame $t$의 reprsentation $\mathbf{R}_{t}$, sequence $\mathbf{R}=[\mathbf{R}_{1},...,\mathbf{R}_{T}]$에 대한 speech representation network $\mathbf{R}=f(\mathbf{X})$를 학습하는 것을 목표로 함
- 여기서 $\mathbf{R}$은 다음의 2가지 property를 만족해야 함:
  1. $\mathbf{R}$은 가능한 많은 content information을 preserve 해야 하고, 이때 content information은 utterance의 phonetic/text transcription에 해당해야 함
  2. $\mathbf{R}$은 speaker variation에 invariant 해야 함

- The General Framework

ContentVec은 HuBERT의 mask prediction framework를 기반으로 함
- 여기서 HuBERT는 speech representation network $f(\cdot)$, predictor $p(\cdot)$, teacher label generator $g(\cdot)$의 3가지 component로 구성됨
- Training 시 speech network representation은 partially masked speech utterance $\tilde{\mathbf{X}}$를 input으로 사용하여 masked speech sequence $\tilde{\mathbf{R}}=f(\tilde{\mathbf{X}})$에 대한 representation을 생성함
  1. Teacher label generator는 unmasked speech로부터 label sequence $\mathbf{L}=g(\mathbf{X})$를 생성하고, predictor는 masked speech representation $\mathbf{R}$로부터 teacher label $\mathbf{L}$을 predict 하는 것을 목표로 함
    - Teacher label generator $g(\cdot)$은 pre-define되어있고 training 시 fix 됨
  2. 결과적으로 $f(\cdot), p(\cdot)$은 다음의 prediction loss를 minimize 하도록 jointly train 됨:
    (Eq. 1) $ \mathcal{L}_{pred}=\mathbb{E}\left[\ell_{m}( p\circ f(\tilde{\mathbf{X}}),g(\mathbf{X}))\right]$
    - $\ell_{m}$ : masked frame에 대한 cross-entropy loss, $f(\tilde{\mathbf{X}})$ : student, $g(\mathbf{X})$ : teacher
- 한편으로 HuBERT teacher가 poor 하더라도, student는 masked prediction mechanism을 통해 teahcer 보다 content를 better preserve 할 수 있음
  1. 따라서 ContentVec은 content losing이 발생할 수 있는 speaker disentanglement technique을 masked prediction mechanism과 combine 하여 content preserving을 개선함
  2. 즉, Disentanglement in Teacher, Disentanglement in Student, Speaker Conditioning의 3가지 component를 ContentVec에 도입함

- Disentangling in Teachers

Disentanglement in Teacher는 teacher label의 speaker information을 제거하는 것을 목표로, teacher label 생성 이전에 모든 utterance를 same speaker로 변환하는 voice conversion model을 도입함
- Teacher label $\mathbf{L}=g(\mathbf{X})$는 다음 3-step을 통해 얻어짐:
  1. 먼저 Training set의 모든 utterance $\mathbf{X}$를 unsupervised voice conversion을 통해 single speaker로 convert 함
  2. Converted utterance는 pre-trained unsupervised speech representation network인 HuBERT로 전달되어 little speaker information을 포함한 speech representation을 생성함
  3. 최종적으로 speech representation은 $k$-means clustering을 통해 discrete teacher label로 quantize 됨
- Teacher speech representation은 speaker disentanglement를 달성할 수는 있지만 content preservation 측면에서는 한계가 있음
  - Voice conversion system이 non-negligible content loss를 발생하기 때문
- 따라서 논문은 해당 output을 downstream task에 directly apply 하지 않고 student를 train 하는 teacher로 활용함

- Disentangling in Students

Disentanglement in Student는 speaker-invariant student representation을 enforce 하고 contrastive learning-based algorithm을 통해 수행됨
- 각 speech utterance $\mathbf{X}$는 speaker information 만을 alter 하는 두 random transformation에 전달되고 mask 됨
  1. $\mathbf{X}$에 대한 2가지 masked, transformed copy를 $\tilde{\mathbf{X}}^{(1)}, \tilde{\mathbf{X}}^{(2)}$라고 하자
  2. 해당 utterance pair는 speech representation network $f(\cdot)$에 전달되어 representation $\mathbf{R}^{(1)}, \mathbf{R}^{(2)}$를 생성함
  3. 이때 다음의 contrastive loss가 적용되어 $\mathbf{R}^{(1)},\mathbf{R}^{(2)}$ 간의 dissimilarity를 penalize 함:
    (Eq. 2) $ \mathcal{L}_{contr}=\sum_{t=1}^{T}\frac{\exp\left(\text{cossim}(\mathbf{R}_{t}^{(1)},\mathbf{R}_{t}^{(2)})/k \right)}{\sum_{\tau\in \{t\}\cup\mathcal{I}_{t}}\exp\left( \text{cossim}(\mathbf{R}^{(1)}_{t},\mathbf{R}_{\tau}^{(1)}) /k\right) }+ \sum_{t=1}^{T}\frac{\exp\left(\text{cossim}(\mathbf{R}_{t}^{(2)},\mathbf{R}_{t}^{(1)}) /k\right)}{\sum_{\tau\in\{t\}\cup\mathcal{I}_{t}} \exp\left(\text{cossim}(\mathbf{R}_{t}^{(2)},\mathbf{R}_{\tau}^{(2)}) /k\right)}$
    - $\text{cossim}(\cdot, \cdot)$ : cosine similarity, $\mathcal{I}_{t}$ : time $t$에서 negative example로 choice 된 representation의 random time index
- Contrastive loss는 $\mathbf{R}^{(1)},\mathbf{R}^{(2)}$에 symmetric 한 2가지 term으로 구성됨
  1. (Eq. 2)를 따라 utterance pair의 negative example $\left(\mathbf{R}_{t}^{(1)}, \mathbf{R}_{t}^{(2)}\right)$는 same utterance 내의 remaining frame에서 uniformly randomly drawn 됨
  2. 한편으로 (Eq. 2)를 확장하여 contrastive loss는 $f(\cdot)$의 final layer 외에도 intermediate layer에 적용될 수 있음
- Contrastive loss 적용 시 utterance의 speaker identity만 altering 하는 random transformation을 구성하는 것이 중요하므로, 논문은 NANSY의 random transformation algorithm을 채택함
  1. 먼저 utterance 내의 모든 formant frequency는 $\rho_{1}$의 factor로 scaling 됨
  2. 다음으로 모든 frame의 $F0$는 $\rho_{2}$의 factor로 scaling 됨
    - $\rho_{1},\rho_{2}$는 uniform distribution $\mathcal{U}\left([1,1,4]\right)$에서 randomly drawn 되고 $0.5$의 probability로 flip 되어 reciprocal을 사용함
  3. 마지막으로 any channel effect를 accommodate 하기 위해 random equalizer를 적용함
    - Voice information의 대부분은 formant frequency와 $F0$ frequency range 형태로 존재하고 content information은 relative formant frequency ratio에 존재함
    - 따라서 이를 통해 모든 formant, $F0$에 대한 uniform scaling은 content를 retaining 하면서 speaker information을 change 할 수 있음
- 추가적으로 invariance를 further strengthen 하기 위해, same random transformation을 masked prediction task의 student representation에 적용하여 (Eq. 1)을 다음과 같이 modify 함:
  (Eq. 3) $\mathcal{L}_{pred}=\mathbb{E}\left[\ell_{m}\left( p\circ f(\tilde{\mathbf{X}}^{(1)}),g(\mathbf{X})\right) +\ell_{m}\left( p\circ f(\tilde{\mathbf{X}}^{(2)}),g(\mathbf{X})\right)\right]$
  - 마찬가지로 masked prediction loss는 $f(\tilde{\mathbf{X}}^{(1)}), f(\tilde{\mathbf{X}}^{(2)})$ 모두에 symmetry하게 적용됨

- Speaker Conditioning

Disentanglement in Teacher를 통해 teacher label의 speaker information을 대부분 제거할 수 있지만, 특정 speaker information이 남아있을 수 있음
- 이로 인해 student representation은 teacher label을 reasonably predict 하기 위해 teacher와 동일한 양의 speaker information을 carry 하도록 undesirably force 됨
- 따라서 논문은 teacher/student 간의 해당 entailment를 break 하기 위해 speaker embedding을 predictor에 제공함
  1. 여기서 speaker embedding은 pre-trained GE2E를 통해 얻어짐
  2. 결과적으로 논문은 predictor를 speaker embedding에 conditioning 함으로써, mask prediction task에 필요한 speaker information을 제공하여 student가 speaker information을 carry 하지 않도록 함
    - 이때 speaker label은 speaker embedding network의 pre-training 과정에서만 사용되고, ContentVec training에서는 speaker label이 아닌 speaker embedding만 사용함
- 구체적으로 masked prediction loss는:
  (Eq. 4) $\mathcal{L}_{pred}=\mathbb{E}\left[\ell_{m} \left(p(f(\tilde{\mathbf{X}}_{1}),s(\mathbf{X})),g(\mathbf{X}) \right)+ \ell_{m}\left(p(f(\tilde{\mathbf{X}}_{2}),s(\mathbf{X})), g(\mathbf{X})\right)\right]$
  - $s(\mathbf{X})$ : speaker embedding
- Final loss는 prediction, contrastive loss를 사용하여 얻어짐:
  (Eq. 5) $\mathcal{L}=\mathcal{L}_{pred}+\lambda\mathcal{L}_{contr}$

- An Information Flow Perspective

앞선 3가지 module은 speech representation network $f(\cdot)$, predictor $p(\cdot)$에 대해, 아래 그림과 같이 layer에 따라 speaker information이 변화함
- 이때 처음에는 speaker information이 input utterance의 full speaker information과 같지만, 마지막에는 teacher label의 speaker information과 비슷한 수준을 가져야 함
  - 특히 speaker information이 re-inject 되는 predictor layer를 제외하고, speaker information은 information processing inequality로 인해 layer에 따라 monotonically decrease 함
- 아래 그림에는 speaker information이 급격하게 변화하는 2가지 구간이 존재함
  1. 먼저 (Eq. 2)의 contrastive loss가 적용되는 경우, speaker information이 크게 감소함
  2. 다음으로 speaker information이 re-inject 되는 경우, speaker information이 증가함
    - 결과적으로 speaker information은 speech representation network와 predictor 간의 intersection에서 minimum에 reach 해야 함

3. Experiments

- Settings

Dataset : LibriSpeech
Comparisons : Wav2Vec 2.0, HuBERT

- Results

전체적으로 ContentVec의 성능이 가장 우수함

SUPERB Experiments
- SUPERB framework에 대해서도 ContentVec이 가장 우수한 성능을 보임

Speaker & Accent Classification
- SID 측면에서 ContentVec은 가장 낮은 identification 성능을 보임
- 즉, ContentVec은 speaker, accent를 효과적으로 disentangle 함

Voice Conversion
- Voice conversion task에 대해 ContentVec을 활용하면 우수한 cosine similarity를 달성할 수 있음

Ablation Study
- 각 component를 제거하는 경우 성능 저하가 발생함

각 layer 측면에서 contrastive loss가 적용되는 지점에서 SID 감소가 나타남

'Paper > Representation' 카테고리의 다른 글

[Paper 리뷰] Multi-Resolution HuBERT: Multi-Resolution Speech Self-Supervised Learning with Masked Unit Prediction (0)	2025.05.17
[Paper 리뷰] LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT (0)	2025.05.14
[Paper 리뷰] FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning (0)	2025.05.08
[Paper 리뷰] DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models (0)	2025.05.07
[Paper 리뷰] SpeechFlow: Generative Pre-Training for Speech with Flow Matching (0)	2025.04.27

최근에 올라온 글

최근에 달린 댓글

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers

ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers

1. Introduction

2. Method

- Problem Formulation

- The General Framework

- Disentangling in Teachers

- Disentangling in Students

- Speaker Conditioning

- An Information Flow Perspective

3. Experiments

- Settings

- Results

'Paper > Representation' 카테고리의 다른 글

티스토리툴바