[Paper 리뷰] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

티스토리 뷰

Paper/Representation

[Paper 리뷰] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

feVeRin 2025. 4. 13. 09:20

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Self-supervised speech representation learning은 다음의 문제에 대응할 수 있어야 함:
- 각 input utterance에 multiple sound unit이 존재함
- Pre-training phase에서 input sound unit에 대한 lexicon이 존재하지 않음
- Sound unit은 explicit segmentation이 아닌 variable length를 가짐
HuBERT
- BERT-like prediction loss의 aligned target label을 제공하기 위해 offline clustering step을 활용
- Masked region에만 prediction loss를 적용하여 continuous input에 대해 model이 combined acoustics와 language model을 학습하도록 유도
논문 (TASLP 2021) : Paper Link

1. Introduction

High-fidelity speech representation은 spoken content의 disentangled aspect 뿐만 아니라 speaker identity, emotion, hesitation과 같은 non-lexical information을 포함함
- 이때 complete situational understanding을 위해서는 overlapping speech signal과 structured noise interleaving을 modeling 할 수 있어야 함
- 한편으로 high-fidelity representation을 얻기 위해 Wav2Vec과 같은 self-supervised learning을 도입할 수 있음
  - Self-supervised learning은 training 중에 label, annotation과 같은 linguistic resource에 의존하지 않으므로 universal representation을 학습할 수 있음
- 특히 self-supervised representation은 Pseudo-Labeling (PL) 방식과 비교하여 다음의 이점을 가짐:
  1. PL은 student model이 teacher model을 merely mimic 하도록 forcing 하므로 teacher model의 supervised data size와 annotation quality에 의해 성능이 제한됨
    - 반면 self-supervised method는 information을 learned latent로 compressing 하여 entire input signal을 represent 할 수 있음
  2. PL에서 teacher model의 supervised data는 single downstream task에만 focus 되도록 함
    - Self-supervised feature는 다양한 downstream application에 적용될 수 있음
- BUT, speech signal에서 self-supervised learning을 적용하는 경우, 다음의 한계점이 존재함:
  1. 각 input utterance에는 multiple sound가 존재하므로 일반적인 pre-training approach에서 사용하는 instance classification assumption이 적합하지 않음
  2. Pre-training 중에 discrete sound unit에 대한 prior lexicon이 없으므로 predictive loss를 활용하기 어려움
  3. Sound unit 간의 boundary는 unknown이므로 masked prediction pre-training이 복잡해짐

-> 그래서 speech domain에서 위 한계점들을 해결한 self-supervised learning method인 HuBERT를 제안

HuBERT
- Offline clustering step을 활용하여 BERT-like per-training noisy label을 생성
  1. BERT는 masked continuous speech feature를 사용하여 pre-determined cluster assignment를 predict 함
  2. Predictive loss는 masked region에만 적용되어, model이 unmasked input에 대한 high-level representation을 학습해 masked input에 대한 target을 correctly infer 하도록 유도함
- 이를 기반으로 continuous input에서 acoustic model과 language model을 모두 학습
  1. Unmasked input을 meaningful continuous latent representation으로 modeling 하여 acoustic modeling 수행함
  2. Prediction error를 줄이기 위해 learned representation 간의 long-range temporal relation을 capture 함

< Overall of HuBERT >

Offline clustering step과 BERT-like training을 활용한 self-supervised speech representation
결과적으로 기존보다 뛰어난 성능을 달성

2. Method

- Learning the Hidden Units for HuBERT

Semi-supervised learning에서 text, speech pair로 training된 acoustic model은 forced alignment를 통해 각 frame에 대한 pseudo-phonetic label을 제공함
- 이와 달리 self-supervised representation learning은 speech-only data를 활용할 수 있음
  - 이때 $k$ -means, Gaussian Mixture Model (GMM)과 같은 discrete latent variable model은 underlying acoustic unit과 non-trivial correlation을 가지는 hidden unit을 infering 함
- 따라서 HuBERT는 acoustic unit discovery model을 사용하여 frame-level target을 제공하는 것을 목표로 함
  1. $X$ 를 $T$ frame의 speech utterance $X = [x_{1}, . . ., x_{T}]$ 라고 하자
  2. 그러면 discovered hidden unit은 $h (X) = Z = [z_{1}, . . ., z_{t}, . . ., z_{T}]$ 와 같음
    - $z_{t} \in [C]$ : $C$ -class categorical variable, $h$ : $k$ -means와 같은 clustering model

- Representation Learning via Masked Prediction

$M \subset [T]$ 을 length- $T$ sequence $X$ 에 대한 masked index, $\tilde{X} = r (X, M)$ 을 $t \in M$ 일 때 $x_{t}$ 가 mask embedding $\tilde{x}$ 로 replace 된 $X$ 의 corrupted version이라고 하자
- Masked prediction model $f$ 는 input $\tilde{X}$ 를 기반으로 각 timestep에서 target index에 대한 distribution $p_{f} (\cdot | \tilde{X}, t)$ 을 predict 함
  - 이때 masked prediction을 위해서는 maksing strategy와 prediction loss applying을 결정해야 함
- 먼저 masking strategy로써 논문은 Wav2Vec 2.0의 approach를 채택함
  - 즉, timestep의 $p %$ 는 start index로 randomly select 되고 $l$ step의 span이 masking 됨
- Prediction loss의 경우,
  1. Masked/unmasked timestep에 대해 compute 된 cross-entropy를 각각 $L_{m}, L_{u}$ 라고 하자
  2. 그러면 $L_{m}$ 은:
    (Eq. 1) $L_{m} (f; X, M, Z) = \sum_{t \in M} \log p_{f} (z_{t} | \tilde{X}, t)$
    - $L_{u}$ 는 $t \notin M$ 에 대한 summation을 제외하면 (Eq. 1)과 동일함
- 결과적으로 final loss는 2가지 loss term에 대한 weighted sum으로써 $L = α L_{m} + (1 - α) L_{u}$ 와 같음
  1. $α = 0$ 인 경우 loss는 unmasked timestep에 대해 compute 되고, 논문에서는 learning process가 clustering model을 mimicking하는 것을 제한하는 것에 해당함
  2. $α = 1$ 인 경우 masked timestep에 대한 loss만 compute되고, 이는 language modeling에서 context의 unseen frame에 해당하는 target을 predict 하는 것과 같음
    - 이를 통해 model은 unmasked segment와 speech data의 long-range temporal structure를 모두 학습할 수 있음
    - 결과적으로 $α = 1$ 을 사용하는 setting이 cluster target quality에 더 resilient 함

- Learning with Cluster Ensembles

Target quality를 개선하기 위해 multiple clustering model을 도입함
- 특히 individual clustering model에 비해 cluster ensemble은 representation learning을 facilitate 하는 complementary information을 제공할 수 있음
- 대표적으로 서로 다른 codebook size를 가지는 $k$ -means model ensemble은 manner class (vowel/consonant)에서 sub-phone state (senones)까지 다양한 granularity의 target을 생성함
  1. 여기서 $Z^{(k)}$ 를 $k$ -th clustering model에 의해 생성된 target sequence라고 하자
  2. 그러면 $L_{m}$ 은 다음과 같이 re-write 됨:
    (Eq. 2) $L_{m} (f; X, {Z^{(k)}}_{k}, M) = \sum_{t \in M} \sum_{k} \log p_{f}^{(k)} (z_{t}^{(k)} | \tilde{X}, t)$
    - Unmasked loss $L_{u}$ 도 마찬가지로 re-write 됨
- 추가적으로 ensembling은 Product Quantization (PQ)와 함께 사용할 수도 있음
  1. 이때 feature space는 multiple subspace로 partition 되고 각 subspace는 separately quantize 됨
  2. PQ를 high-dimensional feature와 subspace 간 scale이 다른 heterogeneous feature에 대한 Euclidean distance-based quantization ( $k$ -means)을 가능하게 함
    - 여기서 target space의 theoretical size는 모든 codebook size의 product와 같음

- Iterative Refinement of Cluster Assignments

Representation을 향상하기 위해 learning process 전반에 걸쳐 cluster assignment를 refining 할 수 있음
- 이때 pre-trained model은 MFCC와 같은 raw acoustic feature에 비해 better representation을 제공함
- 따라서, learned latent representation에 대해 discrete latent model을 training 하여 new cluster를 생성할 수 있음
  - 이후 newly discovered unit을 기반으로 학습을 proceed 함

- Implementation

Pre-trained model은 Wav2Vec 2.0 architecture를 따라 convolutional waveform encoder, BERT encoder, projection layer, code embedding layer로 구성됨
- 여기서 HuBERT에 대해 BASE, LARGE, X-LARGE의 3가지 configuration을 고려할 수 있음:
  1. BASE, LARGE는 Wav2Vec 2.0-BASE/LARGE architecture를 따르고, X-LARGE의 경우 1B paramter로 expand 하여 사용함
  2. Waveform encoder는 kernel width가 $[10, 3, 3, 3, 3, 2, 2]$ 이고 stride가 $[5, 2, 2, 2, 2, 2, 2]$ 인 7개의 512-channel layer로 구성됨
  3. BERT encoder는 아래 표의 configuration을 따르는 transformer block으로 구성됨
- Convolutional waveform encoder는 16kHz sampled audio에 대해 20ms frame rate로 feature sequence를 생성함
  1. 즉, CNN encoder의 down-sampling factor는 $320 \times$ 에 해당함
  2. 이후 audio encoded feature는 randomly mask 되고, BERT encoder는 masked sequence를 input으로 사용하여 feature sequence $[o_{1}, . . ., o_{T}]$ 를 output 함
  3. 결과적으로 codeword에 대한 disitribution은 다음과 같이 parameterize 됨:
    (Eq. 3) $p_{f}^{(k)} (c | \tilde{X}, t) = \frac{\exp (sim (A^{(k)} o_{t}, e_{c}) / τ)}{\sum_{c^{'} = 1}^{C} \exp (sim (A^{(k)} o_{t}, e_{c^{'}}) / τ)}$
    - $A$ : projection matrix, $e_{c}$ : codeword $c$ 의 embedding
    - $sim (\cdot, \cdot)$ : cosine-similarity, $τ = 0.1$ : scaling factor
  4. Cluster ensemble이 사용되는 경우, 각 clustering model $k$ 에 대해 하나의 projection matrix $A^{(k)}$ 가 적용됨
- HuBERT pre-training 이후, frozen 되는 convolutional audio encoder를 제외한 whole model weight의 ASR fine-tuning을 위해 Connectionist Temporal Classification (CTC) loss를 적용함
  - 이때 projection layer가 제거되고 randomly initialized softmax layer로 replace 됨
  - CTC target vocabulary에는 26 English character, space token, apostrophe, special CTC blank symbol이 포함됨

3. Experiments

- Settings

Dataset : LibriSpeech, Libri-light
Comparisons : Wav2Vec 2.0, Discrete BERT, DeCoAR 2.0, SlimIPL

- Results

HuBERT는 low resource setting에서도 우수한 성능을 달성할 수 있음

High-resource setting에서도 마찬가지로 HuBERT의 성능이 가장 뛰어남

Analysis: $K$ -means Stability
- $K$ -means model fitting에 사용되는 data 양을 늘리면 PNMI가 향상됨
- 특히 500 cluster를 사용하는 경우 PNMI가 더욱 개선됨

Analysis: Clustering Quality Across Layers and Iterations
- MFCC는 $K = 100$ 에서 각각 $(0.099, 0.335, 0.225)$ 의 cluster purity, phone purity, PNMI를 달성하고 $K = 500$ 의 경우 각각 $(0.031, 0.356, 0.287)$ 을 달성함
- 한편으로 BASE-it1, BASE-it2 모두 동일한 cluster를 가진 MFCC 보다 더 나은 clustering 결과를 보임
- BASE-it2의 경우 layer를 따라 성능이 향상되지만, BASE-it1은 6th layer의 middle layer에서 best feature를 가짐

Ablation: The Importance of Predicting Masked Features
- Bad cluster assignment에서 학습하는 경우, masked region의 loss만 compute 하는 것이 가장 좋은 성능을 보임
- BUT, clustering quality가 향상됨에 따라 unmasked frame에서만 loss를 compute 하면 성능이 저하될 수 있음

Ablation: The Effect of Cluster Ensembles
- Cluster ensemble을 사용하면 single $k$ -means clustering 보다 더 나은 성능을 얻을 수 있음

Ablation: Impact of Hyperparameter
- Mask start의 frame portion은 $p = 8 %$ 일 때 optimal 함
- Batch size를 늘리면 성능을 크게 향상할 수 있음

(좌) Masking Probability $p$ 별 성능 (우) Batch Size 별 성능

$C = {50, 100}$ 인 $k$ -means model에서는 longer consistency training을 활용하는 것이 좋음

'Paper > Representation' 카테고리의 다른 글

[Paper 리뷰] DistilHuBERT: Speech Representation Learning by Layer-Wise Distillation of Hidden-Unit BERT (0)	2025.04.17
[Paper 리뷰] Robust Data2Vec: Noise-Robust Speech Representation Learning for ASR by Combining Regression and Improved Contrastive Learning (0)	2025.04.11
[Paper 리뷰] Data2Vec-AQC: Search for the Right Teaching Assistant in the Teacher-Student Training Setup (0)	2025.04.10
[Paper 리뷰] Data2Vec 2.0: Efficient Self-Supervised Learning with Contextualized Target Representations for Vision, Speech and Language (0)	2025.04.06
[Paper 리뷰] Data2Vec: A General Framework for Self-Supervised Learning in Speech, Vision and Language (0)	2025.04.05

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

1. Introduction

2. Method

- Learning the Hidden Units for HuBERT

- Representation Learning via Masked Prediction

- Learning with Cluster Ensembles

- Iterative Refinement of Cluster Assignments

- Implementation

3. Experiments

- Settings

- Results

'Paper > Representation' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역