[Paper 리뷰] M2R-Whisper: Multi-Stage and Multi-Scale Retrieval Augmentation for Enhancing Whisper

본문 바로가기 메뉴 바로가기

티스토리 뷰

Paper/ASR

[Paper 리뷰] M2R-Whisper: Multi-Stage and Multi-Scale Retrieval Augmentation for Enhancing Whisper

feVeRin 2025. 6. 18. 17:06

M2R-Whisepr: Multi-Stage and Multi-Scale Retrieval Augmentation for Enhancing Whisper

Whisper는 다양한 subdialect를 acculately recognize 하는데 한계가 있음
M2R-Whisper
- In-Context Learning과 Retrieval-Augmented technique을 Whisper에 도입
- Pre-processing stage에서 sentence-level in-context learning을 적용하고 post-processing stage에서는 token-level $k$-Nearest Neighbor를 적용
논문 (ICASSP 2025) : Paper Link

1. Introduction

Whisper는 Automatic Speech Recognition (ASR)에서 우수한 성능을 보이고 있지만, low-resource language setting에서 활용하기 어려움
- 이때 In-Context Learning (ICL)을 사용하면 fine-tuning 없이도 low-resource adaptation을 지원할 수 있음
  - BUT, 기존 방식들은 isolated word-level ICL에 limit 되어 있음
- ICL 외에도 Retreival-Augmented method을 사용하여 parameter update 없이 ASR 성능을 향상할 수 있음
  1. 특히 $k$-Nearest Neighbor ($k$NN)과 Connectionist Temporal Classification (CTC) pseudo label을 사용하여 speech-text key-value pair를 생성한 다음,
    - Decoding 중에 해당 label을 retrieve하여 output distribution을 refine 함
  2. BUT, 해당 방식은 pseuo label quality에 constrain 되고, substitution error를 제외한 다른 error를 correcting 하는데 instablity를 보임

-> 그래서 ASR adaptation을 향상하기 위해 ICL과 $k$NN을 integrate 한 M2R-Whisper를 제안

M2R-Whisper
- Word-level ICL을 sentence-level ICL로 extend 하여 Whisper의 in-context learning capability를 개선
  - 이는 model에서 pre-processing retrieval mechanism으로 동작함
- 추가적으로 post-processing token-level $k$NN retrieval을 integrate 하여 multi-stage, multi-scale retrieval augmentation ASR system을 구성

< Overall of M2R-Whisper >

Whisper에 sentence-level ICL, token-level $k$NN을 integrate 한 ASR model
결과적으로 기존보다 우수한 ASR 성능을 달성

2. Method

M2R-Whisper는 pre-processing sentence-level ICL과 post-processing token-level $k$NN을 combine 하여 ASR system의 성능을 향상함

Overview

- Pre-Processing Sentence-Level ICL

먼저 논문은 word-level ICL을 sentence-level로 extend 하여 pre-processing retrieval mechanism으로 사용함
- 이를 위해 training set $S$를 사용하여 sentence-level datastore를 구성하고, top-$k$ most similar audio-text pair를 retrieve 하여 Whisper를 prompt 함
- Datastore Construction
  1. Whisper model과 $S$의 input audio $X$가 주어지면, encoder output embedding을 추출하고 frame에 대해 mean pooling을 적용하여 key $f(X)$를 얻음
  2. 이때 ground-truth label $Y$는 value로 사용되고, sentence-level datastore $D_{s}$는 다음과 같이 구성됨:
    (Eq. 1) $D_{s}=\{(f(X),Y)|X\in S\}$
- Prompt Retrieval
  1. Testing 시 query $f(X)$는 test audio에서 derive 되어 datastore의 top-$k$ nearest audio-text pair를 retrieve 하는 데 사용됨
  2. 해당 retrieved pair는 Whisper model이 contextual information을 harness 하도록 prompt 하는 데 사용됨
- Performing ICL for Whisper
  1. Whisper는 input audio의 partial text를 accept 하기 위해 special token $\text{prefix}$를 제공함
    - 이를 통해 30s 이상의 long audio input을 manage 함
  2. 결과적으로 논문은 retrieved prompt audio를 current testing audio와 concatenate 하고 ICL capability를 boost 하기 위해 ground-truth text label을 special token $\text{prefix}$로 사용함

M2R-Whisper Framework

- Post-Processing Token-Level $k$NN

추가적으로 frame-level CTC pseudo label 대신 ground-truth token을 사용하는 token-level datastore를 도입함
- Datastore Construction
  1. Training set $S$의 각 input audio $X$에 대해 token-level datastore $D_{t}$는:
    (Eq. 2) $D_{t}=(K,V)=\{(g(x_{i}),y_{i})|X\in S\}$
    - $y_{i}$ : ground-truth token
  2. 여기서 $g(x_{i})$는 Whisper에서 추출된 $i$-th intermediate embedding으로써, layer normalization 이후 final decoding layer의 Feed-Forward Network (FFN) input과 같음
- Candidate Retrieval
  1. Decoding 시 논문은 intermediate embedding $g(x)$를 query로 추출하여 각 step에서 $k$ nearest neighbor $\mathcal{N}$을 retrieve 함
  2. 이때 $k$NN distribution은 retrieved neighbor $\mathcal{N}$을 기반으로 각 vocabulary unit의 probability를 aggregate 하여 compute 됨:
    (Eq. 3) $P_{kNN}(y|x)\propto \sum_{(K_{i},V_{i})\in\mathcal{N},V_{i}=y}\exp(-d(K_{i},g(x)/\tau))$
    - $K_{i},V_{i}$ : 각각 $i$-th key, value, $\tau$ : temperature, $d(\cdot, \cdot)$ : $L2$ distance
  3. Final prediction은 Whisper output distribution과 $k$NN distribution을 interpolate 하여 얻어짐:
    (Eq. 4) $\tilde{P}(y|x)=\lambda P_{kNN}(y|x)+(1-\lambda)P(y|x)$
    - $\lambda$ : hyperparameter

3. Experiments

- Settings

Dataset : AISHELL-1, KeSpeech
Comparisons : Whisper, $k$NN-CTC, $k$NN-Whisper, Prompt-Whisper

Dataset Details

- Results

전체적으로 M2R-Whisper의 성능이 가장 뛰어남

Model 성능 비교

Prompt 수가 증가하면 CER은 감소하는 반면 RTF는 증가함

Prompt 수에 따른 CER, RTF 비교

M2R-Whisper는 Substitution (S), Insertion (I) error를 줄이는데 효과적임

Substitution (S), Deletion (D), Insertion (I) Error

'Paper > ASR' 카테고리의 다른 글

[Paper 리뷰] LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation (0)	2025.09.21
[Paper 리뷰] Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts (0)	2025.08.30
[Paper 리뷰] Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR (0)	2025.05.22
[Paper 리뷰] Adapting Whisper for Code-Switching through Encoding Refining and Language-Aware Decoding (0)	2025.04.28
[Paper 리뷰] Multilingual DistilWhisper: Efficient Distillation of Multi-Task Speech Models via Language-Specific Experts (0)	2025.04.14

댓글

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

티스토리툴바