[Paper 리뷰] LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT

티스토리 뷰

Paper/Representation

[Paper 리뷰] LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT

feVeRin 2025. 5. 14. 17:42

LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT

Self-supervised representation learning은 storage-intensive Transformer로 인해 low-resource setting에서 활용하기 어려움
LightHuBERT
- Once-for-All Transformer compression framework를 활용하여 structured parameter를 pruning
- Two-stage distillation을 통해 HuBERT의 contextualized latent representation을 반영
논문 (INTERSPEECH 2022) : Paper Link

1. Introduction

Speech2Vec, Wav2Vec 2.0과 같은 Self-Supervised speech representation은 unannotated data를 기반으로 다양한 speech processing task에서 우수한 성능을 보이고 있음
- BUT, memory constraint로 인해 real-world device에서는 해당 pre-trained model을 활용하기 어려움
  - 따라서 resource constraint를 만족하기 위해서는 pre-trained model을 compress 해야 함
- 이때 model compressing을 위해서는 다음을 고려해야 함:
  1. Lightweight, sparse network는 performance drop이 크게 나타남
  2. DistilHuBERT와 같은 기존 방식들은 다양한 resource constraint에 대응하기 어려움

-> 그래서 tolerable training time 내에서 다양한 size의 compression을 지원하는 LightHuBERT를 제안

LightHuBERT
- Once-for-All (OFA)를 기반으로 weight-sharing Transformer supernet을 구축하여 다양한 architecture configuration을 지원
- Pre-training distillation loss와 masked self-supervised learning을 도입
  - 이를 통해 pre-trained HuBERT로부터 contextualized representation을 predict 함
- 추가적으로 two-stage training strategy를 통해 성능 저하를 완화

< Overall of LightHuBERT >

OFA에 기반한 lightweight compressed speech representation framework
결과적으로 기존보다 우수한 성능을 달성

2. Method

LightHuBERT는 speech pre-training에서 Transformer encoder의 model size를 task-agnostic compression framework로 reduce 하는 것을 목표로 함
- 이를 위해 weight의 structured group을 pruning 하여 automatic architecture search를 지원하는 Once-for-All (OFA) Transformer를 채택함
- 추가적으로 contextualized latent representation을 sub-Transformer로 transfer 하기 위해 two-stage training strategy를 도입함

- Once-for-All Transformer

Once-for-All Transformer는 다양한 sub-Transformer를 포함하는 Transformer architecture로써 서로 다른 architecture가 scaling manner로 weight를 share 함
- 여기서 논문은 embedding dimension, attention dimension, head number, FFN ratio, network depth의 5가지 variable dimension에 기반한 Once-for-All Transformer를 구성함
  - Attention dimension (key, query, value matrix)는 $64\times$ head 수로 constrain 함
- Small, large network 간의 interference는 large network의 성능을 저해하므로 논문은 아래 표와 같이 서로 다른 size를 가지는 2개의 supernet을 고려함
  1. 이때 두 supernet은 기존의 Transformer block을 retain 함
  2. Transformer-based speech pre-training은 12-layer, 24-layer HuBERT와 같이 large-scale unlabeled data를 학습하기 위해 deep network를 활용하기 때문

- Pre-Training Distillation

논문은 pre-trained model의 knowledge를 transfer 하기 위해 masking-based pre-training distillation을 사용함
- 이를 위해 student model에서 latent speech representation span을 mask 한 다음, student model이 teacher model output으로 masked part를 predict 하도록 함
- 특히 Data2Vec을 따라 training target으로써 contextualized representation인 average tok-$k$ normalized latent representation을 도입함
  - $k=8$로 설정
- 결과적으로 pre-trained speech model을 teacher로 사용하여, student는 downsampled audio sequence $x$가 주어졌을 때 masked time step $\mathcal{M}$ 내에서 $L1$ distance를 minimize 함:
  (Eq. 1) $ \mathcal{L}\left(f^{t}(x),f^{s}(\hat{x})\right)=\frac{1}{|\mathcal{M}|}\sum_{i\in\mathcal{M}}\left| \bar{f}_{i}^{t}(x)-f_{i}^{s}(\hat{x})\right|$
  - $f^{t}(\cdot)$ : teacher, $f^{s}(\cdot)$ : student, $\hat{x}$ : masking probability $p=0.65$로 mask 된 $x$, $\bar{f}_{i}^{t}(\cdot)$ : $i$-th timestep의 training target

- Two-Stage Training

Once-for-All Transformer의 weight-sharing architecture 성능을 향상하기 위해
- 논문은 다음의 two-stage training을 도입함:
  1. Stage 1 - Distillation
    - Pre-training distillation의 loss function을 통해 scratch로 Once-for-All Transformer의 largest architecture $a_{\text{Largest}}$를 training 함
  2. Stage 2 - Once-for-All Training
    - Distilled weight로 initialize 된 supernet을 사용하여 Once-for-All training을 수행함
    - 특히 supernet training 중 각 forwarding propagation에서 subnet을 randomly sampling 함
- Stage 1에서 derive 된 trained weight는 Stage 2의 initialization으로 사용함
  - 이때 pre-training distillation은 다양한 range의 receptive field와 various resolution으로 feature aggregation을 제공하는 contextualized representation을 통해 subnet training을 지원함

3. Experiments

- Settings

Dataset : LibriSpeech, LibriLight
Comparisons : HuBERT, DistilHuBERT

- Results

Automatic Speech Recognition
- ASR task 측면에서 LightHuBERT가 가장 우수한 성능을 달성함

Once-for-All Transformer는 well-trained sub-architecture를 확보할 수 있음

Universal Representation Evaluation
- SUPERB benchmark에 대해 LightHuBERT는 큰 성능 저하 없이 29%의 parameter 절감이 가능함

Ablation Study
- 각 training stage를 수행하지 않는 경우 성능 저하가 발생함

'Paper > Representation' 카테고리의 다른 글

[Paper 리뷰] ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers (0)	2025.05.18
[Paper 리뷰] Multi-Resolution HuBERT: Multi-Resolution Speech Self-Supervised Learning with Masked Unit Prediction (0)	2025.05.17
[Paper 리뷰] FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning (0)	2025.05.08
[Paper 리뷰] DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models (0)	2025.05.07
[Paper 리뷰] SpeechFlow: Generative Pre-Training for Speech with Flow Matching (0)	2025.04.27

최근에 올라온 글

최근에 달린 댓글

« 2026/03 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT

LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT

1. Introduction

2. Method

- Once-for-All Transformer

- Pre-Training Distillation

- Two-Stage Training

3. Experiments

- Settings

- Results

'Paper > Representation' 카테고리의 다른 글

티스토리툴바