[Paper 리뷰] Multilingual DistilWhisper: Efficient Distillation of Multi-Task Speech Models via Language-Specific Experts

티스토리 뷰

Paper/ASR

[Paper 리뷰] Multilingual DistilWhisper: Efficient Distillation of Multi-Task Speech Models via Language-Specific Experts

feVeRin 2025. 4. 14. 17:42

Multilingual DistilWhisper: Efficient Distillation of Multi-Task Speech Models via Language-Specific Experts

Whisper는 under-represented language에 대해 여전히 낮은 성능을 보임
Multilingual DistilWhisper
- Whisper-Large-V2에 대한 knowledge distillation을 적용
- Language-specific expert를 통한 lightweight modular ASR fine-tuning
논문 (ICASSP 2024) : Paper Link

1. Introduction

Automatic Speech Recognition (ASR) task에서 Whisper는 강력한 성능을 보이고 있음
- 여기서 Whisper는 Wav2Vec 2.0과 비교하여 unseen domain으로 well-generalize 될 수 있지만, 각 language set에서 large model과 small model 간의 상당한 성능 차이가 존재함
  - 특히 large model을 사용하는 경우 small version에 비해 2-3배의 속도 저하가 발생함
- 한편으로 efficient inference를 위해 large multilingual teacher model에서 smaller model로의 Knowledge Distillation (KD)을 수행할 수 있음
  1. BUT, 해당 KD를 Whisper-Large-V2에 적용하기 위해서는 unavailable information에 access 할 수 있어야 함
  2. 이때 다양한 language가 add 될 때마다 consistent performance를 유지하면서 low computation extend가 가능한 Language-Specific (LS) module을 도입할 수 있음

-> 그래서 LS module과 KD를 활용하여 Whisper-Small을 extend 한 Multilingual DistilWhisper를 제안

Multilingual DistilWhisper
- Original feed-forward layer/newly learned LS layer으로 input representation을 routing 하는 Conditional Language-Specific Routing (CLSR) module을 도입
- Whisper-Large-V2를 teacher model로 채택하여 Knowledge Distillation을 수행

< Overall of Multilingual DistilWhisper >

다양한 language에 대응하기 위해 Knowledge Distillation과 CLSR module을 도입한 Whisper
결과적으로 다양한 language에 대한 generalization과 뛰어난 out-of-domain 성능을 달성

2. Method

논문은 limited capacity에서 다양한 language에 대한 ASR 성능을 향상하는 것을 목표로 함
- 이를 위해 Conditional Language-Specific Routing (CLSR) module을 Whisper-Small에 plug 하고,
- Whisper-Large-V2로부터 ASR fine-tuning과 Knowledge Distillation을 통해 해당 module을 jointly optimize 함

- CLSR Module

논문은 CLSR module을 speech domain으로 extend 함
- CLSR module은 hidden embedding $z^{l}$을 사용하여 각 input token에 대한 hard binary gate $g(\cdot)$을 학습함
  1. 그러면 layer는 (Eq. 1)과 같이 LS path $h^{lang}$, shared path $h^{shared}$를 통해 information을 selectively guide 함:
    (Eq. 1) $\text{CLSR}(z^{l})=g(z^{l})\cdot h^{lang}(z^{l})+(1-g(z^{l}))\cdot h^{shared}(z^{l})$
  2. 한편으로 논문에서는 기존 CLSR과 달리 LS gate를 활용함
    - 이를 통해 DistillWhisper는 LS component를 individually train 한 다음, 추론 시에만 relevant module을 load 하여 사용할 수 있음
- 추가적으로 논문은 CLSR을 feed-forward에만 limit 하여 parameter 수를 크게 절감함
- 구조적으로 각 gate $g(\cdot)$은 2-layer bottleneck network로 구성되고, training 중에는 discretization을 위해 increasing zero-mean Gaussian noise로 summation 됨
  - 추론 시에는 hard gating을 채택함

- DistilWhisper Approach

Student model은 각 language에 대한 feed-forward에서 CSLR module을 통해 enrich 됨
- 이때 해당 CLSR layer는 feed-forward layer의 frozen weight로부터 initialize 됨
  1. Training 시 model은 각 language에 대해 해당 LS layer와 gate만 update 함
  2. 추론 시 model은 interest language에 대한 shared (multilingual) layer와 LS module, gate를 load 함
    - 즉, CLSR module은 token-level에서 routing이 가능하므로 adapter에 비해 flexiblity를 확보할 수 있음
- 결과적으로 논문은 LS gating activation을 통해 pre-existing knowledge (shared frozen module)을 leverage 함

- DistilWhisper Optimization

CLSR module parameter를 training 하기 위해,
- Cross-entropy loss $\mathcal{L}_{\text{CE}}$ 외에도 gate budget loss $\mathcal{L}_{\text{g}}$를 도입하여 LS와 language-shared module을 balance 함:
  (Eq. 2) $\mathcal{L}_{\text{g}}=\left|\frac{\sum_{(X,Y)\in\mathcal{B}} \mathcal{G}_{(X,Y)}}{\sum_{(X,Y)\in\mathcal{B}}\left(|X||\mathcal{M}_{\text{enc}}|+|Y||\mathcal{M}_{\text{dec}}|\right) }-b\right|$
  - $\mathcal{G}_{(X,Y)}=\sum_{x\in X}\sum_{m\in\mathcal{M}_{\text{enc}}}g_{m}(x)+\sum_{y\in Y}\sum_{m\in\mathcal{M}_{\text{dec}}}g_{m}(y)$ : batch $\mathcal{B}$의 $(\text{audio, text})$ pair $(X,Y)$에 대한 gate $g(\cdot)$ activation value
  - $\mathcal{M}_{\text{enc}},\mathcal{M}_{\text{dec}}$ : 각각 encoder, decoder layer
  - $g_{m}(\cdot)=1$ : selecting LS layer, $g_{m}(\cdot)=0$ : otherwise
  - $b$ : budget으로써 gate usage를 constraint 하는 역할
- KD의 경우 JS divergence를 채택하여 다음과 같이 구성됨:
  (Eq. 3) $\mathcal{L}_{\text{KD}}=\frac{1}{2}\mathbb{E}_{\mathbf{Y}\sim p}\left[\log \frac{p(\mathbf{Y})}{m(\mathbf{Y})}\right]+\frac{1}{2}\mathbb{E}_{\mathbf{Y}'\sim q_{\theta}}\left[\log \frac{q_{\theta}(\mathbf{Y}')}{m(\mathbf{Y}')}\right]$
  - $p$ : teacher distribution, $q_{\theta}$ : student distribution
  - $\mathbf{Y},\mathbf{Y}'$ : 각각 teacher/student distribution의 sample
  - $m(\cdot) =\frac{1}{2}p(\cdot)+\frac{1}{2}q_{\theta}(\cdot)$ : teacher/student sample의 average
- 결과적으로 CLSR module은 final loss $\mathcal{L}=\mathcal{L}_{\text{CE}}+\mathcal{L}_{\text{g}}+\alpha\mathcal{L}_{\text{KD}}$를 minimize 하여 optimize 됨

3. Experiments

- Settings

Dataset : CommonVoice13, FLEURS
Comparisons : Whisper

- Results

전체적으로 Multilingual DistilWhisper의 성능이 가장 뛰어남

Effect of Training Data Size
- Trainable example 수가 늘어날수록 ASR 성능이 향상됨

Gate Activation Analysis
- Out-of-Domain setting에서 LS module에 의존하는 경향이 있음
- 이때 training data size가 증가하면 LS module의 usage와 reliability가 증가함

'Paper > ASR' 카테고리의 다른 글

[Paper 리뷰] Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR (0)	2025.05.22
[Paper 리뷰] Adapting Whisper for Code-Switching through Encoding Refining and Language-Aware Decoding (0)	2025.04.28
[Paper 리뷰] LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR (0)	2025.04.01
[Paper 리뷰] CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions (0)	2025.03.31
[Paper 리뷰] WhisperX: Time-Accurate Speech Transcription of Long-Form Audio (0)	2025.03.18

최근에 올라온 글

최근에 달린 댓글

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Multilingual DistilWhisper: Efficient Distillation of Multi-Task Speech Models via Language-Specific Experts

Multilingual DistilWhisper: Efficient Distillation of Multi-Task Speech Models via Language-Specific Experts

1. Introduction

2. Method

- CLSR Module

- DistilWhisper Approach

- DistilWhisper Optimization

3. Experiments

- Settings

- Results

'Paper > ASR' 카테고리의 다른 글

티스토리툴바