[Paper 리뷰] DistilHuBERT: Speech Representation Learning by Layer-Wise Distillation of Hidden-Unit BERT

티스토리 뷰

Paper/Representation

[Paper 리뷰] DistilHuBERT: Speech Representation Learning by Layer-Wise Distillation of Hidden-Unit BERT

feVeRin 2025. 4. 17. 20:15

DistilHuBERT: Speech Representation Learning by Layer-Wise Distillation of Hidden-Unit BERT

기존 self-supervised speech representation learning method는 large memory와 high pre-training cost가 요구됨
DistilHuBERT
- HuBERT에서 hidden representation을 directly distill 하는 multi-task learning framework
- 이를 통해 HuBERT size를 75% 절감
논문 (ICASSP 2022) : Paper Link

1. Introduction

Wav2Vec과 같은 speech representation에 대한 Self-Supervised Learning (SSL) method는 unlabeled speech data를 활용하여 학습됨
- 여기서 SSL representation은 generative, discriminative method로 분류될 수 있음:
  1. Generative method는 masked acoustic feature를 reconstruct 하거나 future acoustic feature를 generate 함
  2. Discriminative method는 contrastive learning이나 pseudo label classification을 통해 학습됨
- BUT, Wav2Vec 2.0, HuBERT 등의 기존 speech SSL method는 상당한 memory와 training cost가 필요함
  - 이때 Knowledge Distillation을 활용하면 model을 효과적으로 compress 할 수 있음

-> 그래서 Knowledge Distillation을 통해 HuBERT를 compress 한 DistilHuBERT를 제안

DistilHuBERT
- HuBERT를 distill 한 다음, 3개의 prediction head를 사용하여 각각 4-th, 8-th, 12-th HuBERT hidden layer output을 predict 함
- 이때 multi-task learning paradigm을 통해 rich information을 포함한 representation을 학습

< Overall of DistilHuBERT >

Knowledge Distillation을 적용한 HuBERT-based SSL speech representation
결과적으로 HuBERT 대비 75%의 model size 절감과 73%의 추론 속도 개선을 달성

2. Method

- HuBERT

논문은 HuBERT를 teacher로 취급하여 speech representation을 구성함
- 이때 HuBERT는 CNN과 transformer encoder로 구성되어 randomly masked frame을 pseudo label로 classifiy 함
  - Label은 MFCC나 another model의 hidden unit을 clustering 하여 사용함
- BUT, HuBERT는 다음의 단점이 있음:
  1. HuBERT는 95M ~ 1B의 parameter를 사용하므로 memory consuming과 slow inference의 문제가 있음
  2. 2K GPU hours 이상의 high training cost가 필요함
- 따라서 논문은 위의 한계점을 해결하는 것을 목표로 함

- DistilHuBERT

Self-supervised speech model의 서로 다른 layer는 speaker identity, semantic과 같은 information을 포함하고 있음
- BUT, SSL model은 output layer가 항상 rich information을 제공하지는 않음
  1. 대표적으로 Wav2Vec 2.0은 middle layer에 phonetic information을 store 하므로, teacher의 last layer만 학습하는 것은 효과적이지 않음
  2. 한편으로 student의 hidden layer가 teacher의 서로 다른 layer로부터 학습되도록 할 수 있음
- 따라서 논문은 multi-task knowledge distillation을 활용한 teacher-student framework를 도입함
  - 이를 통해 HuBERT를 distill하여 CNN feature extractor와 small transformer encoder로 구성된 DistilHuBERT를 얻음
- Knowledge distillation은 shared representation으로부터 multiple teacher의 hidden representation을 학습하는 것을 목표로 함
  1. 여기서 논문은 separate prediction head를 사용하여 teacher의 hidden representation을 predict 함
  2. 그러면 objective는 multi-task learning paradigm이 되고, transformer encoder는 multiple prediction head에 대한 compact representation을 생성함
  3. Pre-training 이후에는 head가 제거되고, model parameter는 frozen 되어 다양한 downstream task에 적용됨
Objective Function
- Teacher의 $l$-th layer와 해당하는 student의 prediction head에서 생성된 time $t$의 $D$-dimensional feature vector를 각각 $\hat{\mathbf{h}}_{t}^{(l)},\mathbf{h}_{t}^{(l)}$이라고 하자
- 그러면 loss function은:
  (Eq. 1) $ \mathcal{L}^{(l)}=\mathcal{L}_{\ell 1}^{(l)}+\lambda\mathcal{L}_{\cos}^{(l)} = \sum_{t=1}^{T}\left[\frac{1}{D}\left|\left| \mathbf{h}_{t}^{(l)}-\hat{\mathbf{h}}_{t}^{l}\right|\right|_{1}-\lambda \log \sigma\left(\cos\left(\mathbf{h}_{t}^{(l)},\hat{\mathbf{h}}_{t}^{(l)} \right)\right) \right]$
  - $T$ : time step 수, $\sigma$ : sigmoid activation, $\cos (\cdot, \cdot)$ : cosine similarity
- $\mathcal{L}^{(l)}$을 minimize 하는 것은 $\ell 1$ distance를 minimize 하면서 hidden representation 간의 cosine similarity를 maximizing 하는 것과 같음
  - 특히 $\mathcal{L}_{\ell 1}, \mathcal{L}_{\cos}$를 모두 사용하면 더 나은 성능을 달성할 수 있음
  - $\lambda >0$ : cosine similarity loss를 control 하는 역할
Parameter Initialization
- DistilHuBERT는 HuBERT의 CNN extractor와 처음 2개의 transformer layer로 initialize 됨
Reducing Computation for Distillation
- $L$을 distll 할 self-supervised speech model의 hidden layer 수라고 하면, prediction head는 $1 \text{~} L$개를 가질 수 있음
- 특히 neighboring layer representation에는 similar information이 포함될 수 있으므로 student model은 teacher model의 specific layer만 predict 함

3. Experiments

- Settings

Dataset : LibriSpeech, WSJ, AISHELL-1
Comparisons : DeCoAR 2.0, Wav2Vec, HuBERT

- Results

DistilHuBERT는 HuBERT를 제외한 다른 모든 방식들보다 더 우수한 성능을 보임

DistilHuBERT는 model size, performance 측면에서 더 나은 trade-off를 가짐

Model Size and Inference Speed
- DistilHuBERT는 HuBERT에 비해 75% 더 작은 size를 가지고, inference speed는 73% 더 빠름

Ablation Study
- 각 component를 제거하는 경우 성능 저하가 발생함

Layer Selection
- Head output 중에서 6-th, 12-th layer head는 content, semantic information을 가짐

추가적으로 4-th layer의 경우 speaker identity를 preserve 함

Knowledge Distillation with Different Datasets
- DistilHuBERT는 다른 dataset에 대해서도 안정적인 성능을 보임

'Paper > Representation' 카테고리의 다른 글

[Paper 리뷰] XLS-R: Self-Supervised Cross-Lingual Speech Representation Learning at Scale (0)	2025.04.21
[Paper 리뷰] WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing (0)	2025.04.19
[Paper 리뷰] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units (0)	2025.04.13
[Paper 리뷰] Robust Data2Vec: Noise-Robust Speech Representation Learning for ASR by Combining Regression and Improved Contrastive Learning (0)	2025.04.11
[Paper 리뷰] Data2Vec-AQC: Search for the Right Teaching Assistant in the Teacher-Student Training Setup (0)	2025.04.10

최근에 올라온 글

최근에 달린 댓글

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] DistilHuBERT: Speech Representation Learning by Layer-Wise Distillation of Hidden-Unit BERT

DistilHuBERT: Speech Representation Learning by Layer-Wise Distillation of Hidden-Unit BERT

1. Introduction

2. Method

- HuBERT

- DistilHuBERT

3. Experiments

- Settings

- Results

'Paper > Representation' 카테고리의 다른 글

티스토리툴바