[Paper 리뷰] FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning

티스토리 뷰

Paper/Representation

[Paper 리뷰] FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning

feVeRin 2025. 5. 8. 17:45

FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning

Self-supervised learning은 computational cost 측면에서 한계가 있음
FitHuBERT
- Time-Reduction layer를 사용하여 inference time을 개선
- Hint-based Distillation을 통해 performance degradation을 방지
논문 (INTERSPEECH 2022) : Paper Link

1. Introduction

Large-scale speech Self-Supervised Learning (SSL)은 speech-only data를 pre-training에 활용할 수 있고 small paired data 만으로도 model을 효과적으로 fine-tuning 할 수 있음
- 대표적으로 HuBERT, Wav2Vec 2.0 등은 Automatic Speech Recognition (ASR), Keyword Spotting (KS), Automatic Speaker Verification (ASV)와 같은 다양한 task에서 우수한 성능을 달성함
  - BUT, 해당 SSL model은 상당한 resource가 필요하고 추론 시 computational overhead가 발생함
- 한편으로 DistilHuBERT와 같이 Knowledge Distillation을 활용하여 model compression을 수행할 수 있음
  - BUT, 해당 방식은 linguistic pattern recognition 측면에서 성능 저하가 있음

-> 그래서 thinner, deeper speech SSL distillation method인 FitHuBERT를 제안

FitHuBERT
- CNN feature extractor를 pointwise convolution을 사용하여 channel-increasing manner로 구성
- Hint-based distillation, layer-wise prediction head를 통해 Transformer에 대한 distillation을 guide
- 추가적으로 빠른 추론을 위해 trainable time-reduction layer를 도입

< Overall of FitHuBERT >

Hint-based distillation, time-reduction layer를 활용한 thinner & deeper speech SSL model
결과적으로 기존보다 빠른 추론 속도와 뛰어난 성능을 달성

2. Method

- Model Design

Maintaining Transformer Layers
- 기존 distillation-based speech SSL model은 ASR과 같은 linguistic pattern recognition task에서 성능 저하가 발생함
  1. 대표적으로 DistilHuBERT는 teacher model에 비해 200%의 성능 저하가 나타남
    - 일반적으로 compression은 Transformer layer를 reducing 하는 것으로 수행되므로 student의 depth가 다양한 speech pattern을 recognize 하기 어렵기 때문
  2. 따라서 FitHuBERT는 wider & shallower가 아닌 thinner & deeper Transformer를 얻는 것을 목표로 함
- 먼저 thin Transformer를 위해 self-attention과 FFN의 inner-layer의 dimension을 모두 reduce 함
  - 즉, 논문은 Transformer에서 상당한 computational cost를 차지하는 FFN의 bottleneck sturcture를 eliminate 하는 것을 목표로 함
- 특히 HuBERT, Wav2Vec 2.0에서 50Hz downsampling, 768-dimensional self-attention에 대해 15s speech는 short input으로 취급할 수 있음
  - 즉, SSL model은 해당 short speech를 처리하기 위해 상당한 parameter가 필요한 bottleneck structure를 maintain 할 필요가 없음
Channel-Increasing CNNs
- 기존 speech SSL model은 time-axis를 통해 speech feature를 aggregate 할 수 있는 CNN의 characteristic을 고려하지 않음
- 특히 ASR에서는 mel-spectrogram을 downsampling 하기 위해 channel-increasing CNN을 사용하므로 CNN의 lower layer에 위치한 channel은 불필요하다고 볼 수 있음
  - 즉, model compression을 위해 lower layer channel은 reduce 될 수 있음
- 따라서 FitHuBERT에서 first CNN layer는 teacher에 비해 4배 reduce 됨
  1. 이때 channel은 convolution filter kernel이 decrease 될 때마다 double 되어 final channel은 teacher와 동일해짐
  2. 추가적으로 channel이 double 되기 전에 2개의 pointwise convolution이 추가됨

- Hint-based Knowledge Distillation

FitHuBERT는 small model이므로 unstable training, overfitting이 발생할 수 있음
- 이때 teacher로부터 knowledge를 stably transfer 하려면 teacher의 모든 layer에서 distillation이 되어야 함
  - 특히 intermediate layer에 대한 distillation은 hint와 같이 teacher의 final representation을 guide 하고 intermediate process에 대한 knowledge를 제공할 수 있음
- 따라서 teacher와 동일한 수의 layer로 student를 구성하여 layer-to-layer로 knowledge distillation을 수행함
  1. 먼저 각 Transformer layer에 대해 layer-wise prediction head를 output에 attach 함
    - 여기서 각 head는 temporal deconvolution layer, fully-connected layer로 구성됨
  2. Training 시, 각 12 head는 teacher, student 간의 time-length와 self-attention dimension을 matching 함
  3. Distillation 이후에는 last layer의 prediction head만 remain 되어 fine-tuning stage에 사용됨
- 논문은 teacher, student의 모든 layer representation을 matching 하는 simple MSE loss와 hint-based knowledge distillation loss를 사용함:
  (Eq. 1) $ \mathcal{L}_{feat}=\text{MSE}\left(h_{T}^{(N)},f_{N}(h_{S}^{(N)})\right)$
  (Eq. 2) $\mathcal{L}_{hint}=\sum_{l=1}^{N-1}\text{MSE}\left(h_{T}^{(l)},f_{l}(h_{S}^{(l)})\right)$
  (Eq. 3) $\mathcal{L}_{KD}=\mathcal{L}_{feat}+\lambda\mathcal{L}_{hint}$
  - $h_{T},h_{S}$ : 각각 teacher, student의 Transformer layer representation
  - $N$ : Transformer layer 수, $f$ : 각 layer의 prediction head에 대한 mapping function, $\lambda<1$ : constant

- Time-Reduction Layer

Transformer layer 수를 maintain 하는 것은 inference time 측면에서 이점이 없음
- 따라서 논문은 simple temporal convolution으로 구성된 trainable time-reduction layer를 Transformer layer 이전에 추가함
  1. 각 Transformer layer의 self-attention module은 $\mathcal{O}(N^{2})$의 complexity를 가짐
    - $N$ : input time length
  2. Input sequence length는 time-reduction layer의 stride에 해당하는 time-reduction ratio $k$에 의해 time-axis를 따라 $N/k$로 shrink 될 수 있음
- 결과적으로 self-attention complexity는 $1/k^{2}$로 reduce 하여 Transformer inference speed를 향상할 수 있음

3. Experiments

- Settings

Dataset : LibriSpeech
Comparisons : HuBERT, Wav2Vec 2.0, DistilHuBERT

- Results

전체적으로 FitHuBERT가 가장 우수한 성능을 보임

CNN Architecture Design
- Pointwise convolution을 사용하면 성능을 더 향상할 수 있음

Number of Layers for Hints
- 모든 hint를 사용하면 student에 더 많은 knowledge를 제공할 수 있음

Trade-Off of Time-Reduction Layer
- Time-reduction ratio $k=1$일 때 최적의 성능을 달성함

'Paper > Representation' 카테고리의 다른 글

[Paper 리뷰] Multi-Resolution HuBERT: Multi-Resolution Speech Self-Supervised Learning with Masked Unit Prediction (0)	2025.05.17
[Paper 리뷰] LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT (0)	2025.05.14
[Paper 리뷰] DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models (0)	2025.05.07
[Paper 리뷰] SpeechFlow: Generative Pre-Training for Speech with Flow Matching (0)	2025.04.27
[Paper 리뷰] VQ-Wav2Vec: Self-Supervised Learning of Discrete Speech Representations (0)	2025.04.25

최근에 올라온 글

최근에 달린 댓글

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning

FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning

1. Introduction

2. Method

- Model Design

- Hint-based Knowledge Distillation

- Time-Reduction Layer

3. Experiments

- Settings

- Results

'Paper > Representation' 카테고리의 다른 글

티스토리툴바