[Paper 리뷰] LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR

티스토리 뷰

Paper/ASR

[Paper 리뷰] LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR

feVeRin 2025. 4. 1. 21:22

LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR

Multilingual Automatic Speech Recognition을 위해서는 language interference와 성능 저하 없는 new language incorporation이 필요함
LoRA-Whisper
- Whisper에 LoRA matrix를 incorporate 하여 language interference를 완화
- LoRA와 language 간의 similarity를 활용하여 new language에 대한 성능을 개선
논문 (ICASSP 2024) : Paper Link

1. Introduction

Automatic Speech Recognition (ASR)은 speech를 written text로 transcribe 하는 것을 목표로 함
- 특히 Whisper는 large-scale multilingual dataset을 기반으로 customized multilingual speech recognition이 가능함
- BUT, multilingual ASR에는 다음 2가지의 한계점이 있음:
  1. Language overlap, data imbalance, dialectal accent로 인한 Language Interference
    - 이를 위해 language ID나 language-specific module을 활용할 수 있지만, model design이 complex 해짐
  2. New language에 대한 Language Expansion
    - Naive approach로써 new language에 대해 fine-tuning 하는 방법을 고려할 수 있지만, catastrophi forgetting 문제가 발생함
    - 한편으로 continual learning도 고려할 수 있지만 inefficient 하고 time-consuming 함

-> 그래서 multilingual ASR을 위한 parameter-efficient, extensible model인 LoRA-Whisper를 제안

LoRA-Whisper
- Low-Rank Adaptation (LoRA)를 활용하여 Whisper를 specific language에 대해 tailoring
  - 즉, language 간 shared information은 Whisper model에 store 하고 각 LoRA matrix는 language-specific information을 capture 함
- New language incorporating 시에는 new LoRA matrix를 assign 하여 existing language의 성능 저하를 회피
- 추가적으로 new language와 base language 간의 similarity를 활용한 LoRA matrix initialization과 Mixture of Experts (MoE)를 지원

< Overall of LoRA-Whisper >

LoRA와 Whisper를 결합한 mulitlingual ASR model
결과적으로 기존보다 뛰어난 ASR 성능을 달성

2. Background

- Whisper

Whisper는 multilingual speech recognition, speech translation, language identification 등의 multiple speech task를 수행할 수 있는 encoder-decoder Transformer model
- 먼저 Whisper는 input으로 30-seconds length의 80-dimensional log-mel-spectrogram $X = [x 1, x 2, . . ., x T] <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mo>=</mo><mo stretchy="false">[</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>,</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub><mo stretchy="false">]</mo></math>$ 를 사용함
  - $T <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>T</mi></math>$ : context length
- 그러면 encoder block은 input speech feature를 hidden representation $H <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">H</mi></mrow></math>$ 로 encoding 함:
  (Eq. 1) $H = AudioEncoder (X) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">H</mi></mrow><mo>=</mo><mtext>AudioEncoder</mtext><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mo stretchy="false">)</mo></math>$
- Decoder block은 hidden representation을 previous token과 special prompt $p <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi></math>$ 에 따라 recursively condition 하여 text token $ˆ y <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>y</mi><mo stretchy="false">^</mo></mover></mrow></math>$ 로 decoding 함:
  (Eq. 2) $ˆ y t = TextDecoder (p, ˆ y 1 : t - 1, H) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mi>y</mi><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mtext>TextDecoder</mtext><mo stretchy="false">(</mo><mi>p</mi><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>y</mi><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>:</mo><mi>t</mi><mo>-</mo><mn>1</mn></mrow></msub><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">H</mi></mrow><mo stretchy="false">)</mo></math>$

- LoRA

LoRA는 일반적으로 specific domain이나 downstream task에 맞게 Large Language Model (LLM)을 tailor 하기 위해 사용됨
- 특히 LoRA는 original weight를 fix 한 다음, rank decomposition matrix pair를 학습하는 방식으로 trainable parameter를 줄임
- 즉, $i <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>i</mi></math>$ -th feed forward layer $f i (x) = W i x + b i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">W</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mi>x</mi><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">b</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub></math>$ 가 주어졌을 때, LoRA는 forward process를 다음과 같이 modify 함:
  (Eq. 3) $f i (x) = (W i + Δ W i) x + b i, Δ W i = B i A i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">W</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo>+</mo><mi mathvariant="normal">Δ</mi><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">W</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo stretchy="false">)</mo><mi>x</mi><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">b</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi mathvariant="normal">Δ</mi><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">W</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">B</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub></math>$
  - $W i \in R d 1 \times d 2, b i \in R d 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">W</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>\times</mo><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub></mrow></msup><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">b</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></mrow></msup></math>$ : 각각 frozen weight, bias
  - $B i \in R d 1 \times r, A i \in R r \times d 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">B</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>\times</mo><mi>r</mi></mrow></msup><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>r</mi><mo>\times</mo><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub></mrow></msup></math>$ : rank $r ≪ min (d 1, d 2) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>r</mi><mo>≪</mo><mo data-mjx-texclass="OP" movablelimits="true">min</mo><mo stretchy="false">(</mo><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>,</mo><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 에서의 trainable low-rank matrix

3. Method

LoRA-Whisper는 multilingual ASR task에서 language interference와 new language incorporation 문제를 해결하는 것을 목표로 함

- Problem Statement

Multilingual ASR의 경우 $n <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi></math>$ base language를 사용하고, language expansion은 $m <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>m</mi></math>$ new language를 사용함
- 즉, $S 1 = {(X i, Y i), i \in (1, n)}, S 2 = {(X j, Y j), j \in (n + 1, n + m)} <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>S</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>=</mo><mo fence="false" stretchy="false">{</mo><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo stretchy="false">)</mo><mo>,</mo><mi>i</mi><mo>\in</mo><mo stretchy="false">(</mo><mn>1</mn><mo>,</mo><mi>n</mi><mo stretchy="false">)</mo><mo fence="false" stretchy="false">}</mo><mo>,</mo><msub><mi>S</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub><mo>=</mo><mo fence="false" stretchy="false">{</mo><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mrow data-mjx-texclass="ORD"><mi>j</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>j</mi></mrow></msub><mo stretchy="false">)</mo><mo>,</mo><mi>j</mi><mo>\in</mo><mo stretchy="false">(</mo><mi>n</mi><mo>+</mo><mn>1</mn><mo>,</mo><mi>n</mi><mo>+</mo><mi>m</mi><mo stretchy="false">)</mo><mo fence="false" stretchy="false">}</mo></math>$ 과 같이 나타낼 수 있음
  - $X i, Y i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub></math>$ : 각각 $i <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>i</mi></math>$ -th language의 speech, transcription
- Multilingual ASR은 $S 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>S</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$ 의 language interference를 완화하고 base language에 대한 ASR 성능 향상을 목표로 함
- Language expansion은 $S 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>S</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$ 의 성능에 영향을 주지 않으면서 $S 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>S</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub></math>$ 를 multilingual model에 incorporate 하고 $S 1, S 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>S</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>,</mo><msub><mi>S</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub></math>$ 의 similarity를 활용하여 $S 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>S</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub></math>$ 에 대한 ASR 성능을 향상하는 것을 목표로 함

- Multilingual ASR

LoRA를 multilingual ASR에 적용하면 language interference를 완화할 수 있음
- 먼저 각 language에 대해 language-specific LoRA matrix가 Whisper의 encoder, decoder에 append 됨
  - 이때 input이 $k <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi></math>$ -th language의 speech이면 $k <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi></math>$ -th LoRA module을 activate 하고 forward pass에서 Whisper와 해당 LoRA module을 pass 함
- 특히 LoRA-Whipser에서 language에 대한 shared information은 original Whisper model 내에 존재하지만, language-specific information은 LoRA module에 store 됨
  - 이를 통해 language interference 문제를 avoid 하고, specific language에 대한 성능도 향상할 수 있음

- Language Expansion

LoRA는 language interference 완화 외에도, catastropic forgetting을 방지하여 language expansion을 지원함
- 특히 language 간 similarity를 활용하면 new language에 대한 effective training이 가능함
- 이를 위해 논문은 LoRA-warm start와 LoRA-MoE를 도입함
여기서 각 method는 다음의 2-step을 따름:
1. Step 1: Find the Most Similar Language
  - 먼저 new language를 incorporate 하기 위해 new language data에서 $M <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>M</mi></math>$ audio segment를 random sampling 함
    - 해당 audio는 language detection을 위한 Whisper model을 통해 process 되고, output은 모든 language에 대한 probability distribution을 제공함
  - 다음으로 multilingual ASR에서 사용된 language에 대해서만 focus 하여, 해당 language와 관련된 probability를 추출하고 normalize 함:
    - 즉, $p i = [p i 1, p i 2, . . ., p i n], i = 1, . . ., M <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo>=</mo><mo stretchy="false">[</mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mn>1</mn></mrow></msub><mo>,</mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mn>2</mn></mrow></msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>n</mi></mrow></msub><mo stretchy="false">]</mo><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>i</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><mi>M</mi></math>$
  - 결과적으로 language incorporating 시, new language와 base language 간의 similarity를 계산하여 most similar language를 찾을 수 있음:
    (Eq. 4) $simk=∑Mi=1I(k=argmaxjpij)M,fork=1,...,n<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mtext>sim</mtext><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow></msub><mo>=</mo><mfrac><mrow><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>M</mi></mrow></munderover><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">I</mi></mrow><mo stretchy="false">(</mo><mi>k</mi><mo>=</mo><mi>arg</mi><mo data-mjx-texclass="NONE">⁡</mo><munder><mo data-mjx-texclass="OP" movablelimits="true">max</mo><mrow data-mjx-texclass="ORD"><mi>j</mi></mrow></munder><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>j</mi></mrow></msub><mo stretchy="false">)</mo></mrow><mi>M</mi></mfrac><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mtext>for</mtext><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>k</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><mi>n</mi></math>$
    - $I <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">I</mi></mrow></math>$ : indicator function
    - $sim k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mtext>sim</mtext><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow></msub></math>$ : $k <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi></math>$ -th base language와 new language 간의 similarity
2. Step 2: Continual Training on New Languages
  - Most similar language를 찾은 다음, base language information을 활용하여 new language에 대한 training을 facilitate 함
  - LoRA-warm start의 경우, new LoRA matrix는 most similar language의 LoRA matrix로 initialize 됨
  - LoRA-MoE의 경우, 2개의 LoRA module이 forward pass에서 select 되어 new language training을 지원함

4. Experiments

- Settings

Dataset : FLEURS, MLS
Comparisons : Whisper

- Results

LoRA rank $r = 32 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>r</mi><mo>=</mo><mn>32</mn></math>$ 일 때 최적의 성능을 달성함

Multilingual ASR
- LoRA-Whisper는 Whisper 보다 더 적은 training parameter로 더 우수한 multilingual ASR 성능을 달성함

Language Expansion
- New language data에 대해서도 우수한 WER을 달성함

Ablation Study
- Similar language의 LoRA matrix에 대해 new LoRA matrix를 initialize 할 때 최적의 성능을 달성할 수 있음

'Paper > ASR' 카테고리의 다른 글

[Paper 리뷰] CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions (0)	2025.03.31
[Paper 리뷰] WhisperX: Time-Accurate Speech Transcription of Long-Form Audio (0)	2025.03.18
[Paper 리뷰] Whisper: Robust Speech Recognition via Large-Scale Weak Supervision (0)	2025.03.01

Let IT Begin Voice Engineer | 심심하면 앨범 리뷰 올립니다

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR

LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR

1. Introduction

2. Background

- Whisper

- LoRA

3. Method

- Problem Statement

- Multilingual ASR

- Language Expansion

4. Experiments

- Settings

- Results

'Paper > ASR' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역