[Paper 리뷰] Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts

티스토리 뷰

Paper/ASR

[Paper 리뷰] Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts

feVeRin 2025. 8. 30. 07:41

Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts

Hard parameter sharing은 task interference로 인해 model performance가 저하됨
S-MoE
- 각 task를 designated expert에 route 하는 special guiding token을 활용해 gating function을 eliminate
- 해당 S-MoE를 Speech-to-Text model에 적용하여 mixed-bandwidth input을 처리
논문 (INTERSPEECH 2025) : Paper Link

1. Introduction

Speech-to-Text (STT) model은 richer acoustic feature를 제공하는 wideband (WB) audio에서 주로 training 되지만 mobile environment에서는 narrowband (NB) audio가 사용됨
- 따라서 sampling rate gap으로 인해 NB, WB STT model을 separately training 해야 함
- 이를 해결하기 위해서는 Multi-Task Learning (MTL)을 통해 single model이 다양한 input, task에 대한 shared representation을 학습해야 함
  1. BUT, hard parameter sharing과 같은 기존 MTL method는 task interference로 인한 성능 저하가 발생함
  2. 이때 Mixture of Expert (MoE)와 같은 modular architecture를 사용하면 다양한 task에 대해 specialized parameter를 allocate 할 수 있지만, routing을 위한 gating function에 의존적임

-> 그래서 더 나은 multi-task STT training을 위해 S-MoE를 제안

S-MoE
- Special guiding token을 도입하여 task를 dedicate expert network로 explicitly route
- 이후 STT model에 적용하여 다양한 bandwidth의 input을 처리

< Overall of S-MoE >

Special guiding token을 활용한 Supervised Mixture of Experts method
결과적으로 기존보다 우수한 성능을 달성

2. Method

- Supervised Mixture of Experts (S-MoE)

Standard MoE layer $E_{i}(x)$를 input $x$에 대한 $i$-th expert network output이라고 하자
- 이때 S-MoE module output $y$는:
  (Eq. 1) $ y=\sum_{i=0}^{n-1}G'(x)_{i}E_{i}(x)$
- 특히 기존 MoE model은 learnable gating network $G$에 의존하지만, S-MoE는 pre-defined gating function $G'$을 사용하여 additional gating network training을 eliminate 함
  - 논문에서 encoder, decoder의 expert network 수 $n$은 $2$로 fix 됨
Gating Function for Encoder S-MoE
- NB에서 training 된 model은 WB speech에 대해서는 성능이 저하됨
- 따라서 single model 내에서 NB, WB signal을 handle 하기 위해 논문은 encoder에 S-MoE를 incorporate 함
  1. 여기서 Feed-Forward Network (FFN)은 각 bandwidth에 대한 specialized expert로써 사용되고, encoder block의 나머지 component는 NB/WB input에 대해 share 됨
  2. Bandwidth information은 각 input에 대해 pre-label 되어 있으므로, gating function $G'(x)$는 appropriate expert를 selectively activate 할 수 있음
    - Standard MoE와 마찬가지로, $G'(x)=0$인 경우 해당 expert $E_{i}(x)$는 compute 되지 않음
- 결과적으로 WB signal은 expert $E_{0}$, NB signal은 expert $E_{1}$을 통해 process 되고, 이때 Encoder S-MoE의 gating function은:
  (Eq. 2) $G'_{enc}(x)_{i}=\left\{\begin{matrix}
  G'_{enc}(x)_{0}=\left\{\begin{matrix}
  0, & \text{if}\,\,x\,\,\text{is NB signals} \\
  1, & \text{if}\,\,x\,\,\text{is WB signals} \\
  \end{matrix}\right. \\
  G'_{enc}(x)_{1}=\left\{\begin{matrix}
  0, & \text{if}\,\,x\,\,\text{is WB signals} \\
  1, & \text{if}\,\,x\,\,\text{is NB signals} \\
  \end{matrix}\right.
  \end{matrix}\right.$
Gating Function for Decoder S-MoE
- Encoder와 마찬가지로 논문은 decoder에도 S-MoE를 적용해 Automatic Speech Recognition (ASR)과 Speech Translation (ST) task를 handling 함
  - 여기서 decoding expert는 task tag를 text input에 prepend 하여 결정됨
- 그러면 target task를 따라 gating function은 ASR input을 expert $E_{1}$으로 ST input을 expert $E_{0}$로 direct 함
- 결과적으로 얻어지는 Decoder S-MoE의 gating function는:
  (Eq. 3) $G'_{dec}(x)_{i}=\left\{\begin{matrix}
  G'_{dec}(x)_{0}=\left\{\begin{matrix}
  0, & \text{if task of}\,\,x\,\,\text{is ASR} \\
  1, & \text{if task of}\,\,x\,\,\text{is ST} \\
  \end{matrix}\right. \\
  G'_{dec}(x)_{1}=\left\{\begin{matrix}
  0, & \text{if task of}\,\,x\,\,\text{is ST} \\
  1, & \text{if task of}\,\,x\,\,\text{is ASR} \\
  \end{matrix}\right.
  \end{matrix}\right.$

- Embedding Flow of the S-MoE Model

Training process는 model guide를 위한 special token을 활용함
- 각 sequence의 beginning에는 task tag $\text{<transcribe>}$ 또는 $\text{<translate>}$가 insert 되고, target language tag $\text{<en>}$ 또는 $\text{<ko>}$가 추가됨
- Sequence는 $\text{<beginning_of_sentence>}$ token으로 시작하고 이후에는 target text가 이어짐
Training Phase
- NB (8kHz) 또는 WB (16kHz)의 input audio는 먼저 Transformer encoder로 전달됨
  1. Encoder 내에서는 Encoder S-MoE의 gating function을 따라 select 된 FFN block이 사용됨
  2. Encoder S-MoE의 gating function은 input signal의 bandwidth에 따라 expert를 결정함
    - 즉, NB/WB input에 대해 서로 다른 FFN block이 사용되므로, model은 bandwidth-specific representation을 effectively capture 할 수 있음
- Decoder에서는 shared encoded representation을 process 하기 위해, Decoder S-MoE에서 2개의 separate FFN block을 활용함
- 한편으로 각 training batch는 single task의 sample로 구성되어 task-specific optimization을 보장함
  - 이때 model은 ASR, ST task를 번걸아가며 interleaved batch로 training 됨
Inference Phase
- 추론 시 model은 ASR, ST output을 simultaneously generate 할 수 있음
- 이때 batch size를 $2$로 설정하고 ASR, ST task를 assign 하여 single inference step 만으로도 transcription, translation result를 모두 얻을 수 있음

3. Experiments

- Settings

Dataset : AIHub
Comparisons : Whisper

- Results

전체적으로 S-MoE를 사용했을 때 가장 우수한 성능을 달성할 수 있음

Encoder-Decoder S-MoE
- 특히 STT를 위한 Encoder-Decoder architecture에서 S-MoE는 뛰어난 성능을 달성함

'Paper > ASR' 카테고리의 다른 글

[Paper 리뷰] BlockDecoder: Boosting ASR Decoders with Context and Merger Modules (0)	2025.11.10
[Paper 리뷰] LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation (0)	2025.09.21
[Paper 리뷰] M2R-Whisper: Multi-Stage and Multi-Scale Retrieval Augmentation for Enhancing Whisper (0)	2025.06.18
[Paper 리뷰] Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR (0)	2025.05.22
[Paper 리뷰] Adapting Whisper for Code-Switching through Encoding Refining and Language-Aware Decoding (0)	2025.04.28

최근에 올라온 글

최근에 달린 댓글

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts

Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts

1. Introduction

2. Method

- Supervised Mixture of Experts (S-MoE)

- Embedding Flow of the S-MoE Model

3. Experiments

- Settings

- Results

'Paper > ASR' 카테고리의 다른 글

티스토리툴바