[Paper 리뷰] WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

티스토리 뷰

Paper/ASR

[Paper 리뷰] WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

feVeRin 2025. 3. 18. 21:54

WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

Weakly-supervised speech recognition model은 각 utterance에 해당하는 predicted timestamp가 inaccurate 하고 word-level timestamp를 out-of-the-box로 사용할 수 없음
특히 sequential natrue로 인해 long audio의 buffered transcription을 통한 batched inference가 어려움
WhisperX
- Word-level timestamp를 가진 time-accurate speech recognition model
- Voice Activity Detection과 forced phoneme alignment를 활용하여 long-form transcription 성능을 향상
논문 (INTERSPEECH 2023) : Paper Link

1. Introduction

Weakly supervised/unsupervised training method는 speech recognition, separation 등의 task에서 우수한 성능을 보임
- 특히 Whisper는 weakly supervised pretraining과 96 language, 680,000 hours의 speech training data를 활용하여 우수한 speech transcription 성능을 달성함
- 한편으로 real-world application을 위해서는 podcast, video와 같은 long-form audio를 transcribe 할 수 있어야 함
  1. BUT, 대부분의 Automatic Speech Recognition (ASR) model은 short audio segment에서 training되고 transformer architecture의 memory constraint로 인해 long-input audio를 처리하기 어려움
    - 특히 Whisper의 경우 30-seconds audio segment를 사용함
  2. 추가적으로 Whisper는 overlapping/incomplete audio를 방지하기 위해 buffered transcription approach를 사용함
    - 결과적으로 한 window의 timestamp inaccuracy가 subsequent window에 accumulate되므로 drifting이 발생하기 쉬움
- 이때 foced alignment와 같이 speech transcript를 word/phoneme-level에서 align하는 경우 ASR model의 성능을 개선할 수 있음

-> 그래서 word-level timestamp와 Whisper를 결합하여 long-form audio transcribing을 수행하는 WhisperX를 제안

WhisperX
- External Voice Activity Detection (VAD) model을 사용하여 input audio를 pre-segmenting
- Resulting VAD segment를 Cut & Merge하여 batched Whisper transcription을 지원
- External phoneme model에 대해 forced alignment를 수행하여 accurate word-level timestamp를 제공

< Overall of WhisperX >

VAD와 word-level forced alignment를 활용하여 기존 Whisper를 개선
결과적으로 long-form audio에 대해 기존보다 뛰어난 성능을 달성

2. Method

- Voice Activity Detection

Voice Activity Detection (VAD)는 speech가 포함된 audio stream을 identifying 하는 것을 목표로 함
- WhisperX의 경우 해당 VAD를 사용하여 input audio를 pre-segment 함
  1. VAD는 ASR보다 cheaper 하고, long inactive speech region에 대한 unnecessary forward pass를 avoid 할 수 있음
  2. Audio를 inactive speech region에 대한 boundary를 가진 chunk로 slice 하여 boundary effect를 minimize 하고 parallelized transcription을 지원할 수 있음
  3. VAD model에서 제공하는 speech boundary를 사용하여 word-level alignment task를 local segment로 constrain 하고 Whisper timestamp에 대한 reliance를 제거할 수 있음
    - Whisper timestamp는 unreliable 하기 때문
- 일반적으로 VAD는 sequence labelling task로 formulate 할 수 있음
  1. 먼저 input audio waveform을 time step 당 추출된 acoustic feature vector sequence $\mathbf{A}=\{a_{1},a_{2},..,a_{T}\}$, output을 binary label sequence $\mathbf{y}=\{y_{1},y_{2},..,y_{T}\}$이라 하자
    - 이때 time step $t$에서 speech가 있으면 $y_{t}=1$, 그렇지 않으면 $y_{t}=0$
  2. 그러면 VAD model $\Omega_{V}:\mathbf{A}\rightarrow \mathbf{y}$는 neural network로 instantiate 되고, output predicition $y_{t}\in[0,1]$은 binarize step을 통해 post-process 됨
    - Binarize step은 smoothing stage (onset/offset threshold)와 decision stage (min. duration on/off)로 구성됨
  3. Binary prediction은 start/end index를 가지는 active speech segment sequence $\mathbf{s}=\{s_{1},s_{2},...,s_{N}\}$으로 represent 됨

- VAD Cut & Merge

Active speech segment $\mathbf{s}$는 arbitrary length를 가지고 ASR model의 maximum input duration 보다 훨씬 짧거나 길 수 있음
- 즉, longer segment는 single forward pass로 transcribe 할 수 없음
  - 이를 위해서는 active speech segment length가 ASR model의 maximum input duration을 넘지 않아야 함
- 따라서 논문은 binary post-processing의 smoothing stage에서 Min-Cut operation을 사용하여 active speech segment duration에 대한 upper bound를 제공함
  1. 결과적으로 minimum voice activation score point에서 longer speech segment를 cutting 하는 방식으로 수행됨
  2. 이때 newly divided speech segment가 exceedingly short 하지 않고 sufficient context를 가질 수 있도록 $\frac{1}{2}|\mathcal{A}_{\text{train}}|$과 $|\mathcal{A}_{\text{train}}|$ 사이로 cutting이 restrict 됨
    - $|\mathcal{A}_{\text{train}}|$ : input audio의 maximum duration (Whisper의 경우 30-seconds)
- Input segment에 대한 duration upper bound가 설정되면, 다음으로 short segment를 고려해야 함
  1. 여기서 brief speech segment를 transcribe 하는 경우 broader context benefit이 eliminate 될 수 있음
    - 여러 개의 shorter segment를 transcribe 하면 forward pass 수가 증가하므로 total transcription time이 증가함
  2. 따라서 논문은 Min-Cut 이후에 Merge operation을 도입함
    - 즉, $\tau \leq |\mathcal{A}_{\text{train}}|$에 대해 maximal duration threshold $\tau$보다 작은 aggregate temporal span을 가지는 neighbouring segment를 merge 함
  3. 경험적으로 $\tau=|\mathcal{A}_{\text{train}}|$일 때 optimal 하고, 이 경우 transcription 중에 context를 maximize 할 수 있고 segment duration distribution이 observed distribution과 close 하게 나타남

- Whisper Transcription

Resulting speech segment는 model input size와 거의 동일한 duration을 가짐 ($|s_{i}|\approx |\mathcal{A}_{\text{train}}|\,\, \forall i\in N$)
- 그러면 active speech에 위치하지 않는 boundary는 Whisper $\Omega_{W}$를 통해 efficiently transcribe 되고 각 audio segment에 대한 text를 output 함 ($\Omega_{W} : \mathbf{s}\rightarrow \mathcal{T}$)
- 여기서 parallel transcription은 previous text에 대한 conditioning 없이 수행되어야 함
  1. 그렇지 않은 경우, causal conditioning이 batch에 대한 각 sample의 independenc assumption을 break 하기 때문
  2. 실제로 previous text에 대한 conditioning은 hallucination과 repetition에 취약함
- 추가적으로 논문은 Whisper의 no timestamp decoding method를 사용함

- Forced Phoneme Alignment

각 audio segment $s_{i}$와 word sequence로 구성된 text transcription $\mathcal{T}_{i}=[w_{0},w_{1},...,w_{m}]$에 대해, WhisperX는 각 word의 start/end time을 추정하는 것을 목표로 함
- 이를 위해 논문은 하나의 word를 다른 word와 distinguish 하는 smallest speech unit을 classify 하도록 training 된 Phoneme Recognition model을 활용함
  1. 먼저 $\mathcal{C}$를 model의 phoneme class set $\mathcal{C}=\{c_{1},c_{2},...,c_{K}\}$라 하자
  2. Input audio segment가 주어지면 phoneme classifier는 audio segment $S$를 input으로 하여 logits matrix $L\in\mathbb{R}^{K\times T}$를 output 함
    - $T$ : phoneme model의 temporal resolution
- 즉, 각 segment $s_{i}\in\mathbf{s}$와 해당 text $\mathcal{T}_{i}$에 대해:
  1. Phoneme model에 대해 common segment text $\mathcal{T}_{i}$의 unique phoneme class set $\mathcal{C}_{\mathcal{T}_{i}}\subset \mathcal{C}$을 추출함
  2. Input segment $s_{i}$에 대해 $\mathcal{C}_{\mathcal{T}_{i}}$ class로 restrict 하여 phoneme classification을 수행함
  3. Resulting logits matrix $L_{i}\in\mathbb{R}^{\mathcal{C}_{\mathcal{T}_{i}}\times T}$에 Dynamic Time Warping (DTW)를 적용하여 $\mathcal{T}_{i}$에서 phoneme의 optimal temporal path를 구함
  4. Word 내에서 first/last phoneme의 start/end time을 구해 $\mathcal{T}_{i}$에서 각 word $w_{i}$에 대한 start/end time을 구함
- Phoneme model dictionary $\mathcal{C}$에 존재하지 않는 transcript phoneme의 경우 transcript에서 next nearest phoneme의 timestamp를 assign 함
  - 해당 for loop는 parallel로 batch process 하여 long-form audio에 대한 fast transcription, word-alignment를 지원함

- Multi-Lingual Transcription and Alignment

WhisperX는 multilingual transcription에도 적용할 수 있음
- 이때 VAD model은 서로 다른 language에 대해 robust 해야 하고, Alignment phoneme model은 해당 language에 대해 train 되어야 함
- Multilingual phoneme recognition model을 활용하여 unseen language에 대해 training 할 수도 있음
  - 이 경우 language-independent phoneme에서 target language phoneme으로의 additional mapping이 필요함

- Translation

Whisper는 translated transcription을 제공하는 translate mode를 추가적으로 제공함
- 이때 batch VAD-based transcription도 translation setting에 활용할 수 있음
- BUT, speech와 translated transcript 간에 phonetic audio-linguistic alignment가 존재하지 않으므로 phoneme alignment는 불가능함

- Word-Level Timestamps without Phoneme Recognition

External phoneme model 없이 Whisper에서 word-level timestamp를 directly extract 하여 phoneme mapping을 제거하고 inference overhead를 줄일 수 있음
- 실제로 alignment overhead는 약 $<10\%$ 수준에 해당함
- 반면 cross-attention score에서 timestamp를 추론하는 경우, external phoneme alignment와 비교하여 성능이 저하되고 timestamp inaccuracy가 발생함

3. Experiments

- Settings

Dataset : AMI Meeting, Switchboard-1 Telephone, TEDLIUM-3, Kincaid46
Comparisons : Whisper, Wav2Vec 2.0

- Results

Word Segmentation Performance
- 전체적으로 WhisperX의 성능이 가장 뛰어남
- 특히 낮은 IER 값은 VAD Cut & Merge를 통해 hallucination이 방지됨을 의미함

Long-Form Audio Transcription, Word Segmentation

Effect of VAD Chunking
- VAD chunking 없이 batched transcription을 수행하면 boundary effect로 인해 transcription quality가 저하됨
- Cut & Merge threshold $\tau$는 Whisper를 $|\mathcal{A}_{\text{train}}|=30$에서 train 했을 때 input duration과 같음

Effect of Chosen Whisper and Alignment Models
- Larger Whisper model을 사용할수록 precision, recall 모두 향상됨
- 반면 bigger phoneme model의 효과는 적음

'Paper > ASR' 카테고리의 다른 글

[Paper 리뷰] Adapting Whisper for Code-Switching through Encoding Refining and Language-Aware Decoding (0)	2025.04.28
[Paper 리뷰] Multilingual DistilWhisper: Efficient Distillation of Multi-Task Speech Models via Language-Specific Experts (0)	2025.04.14
[Paper 리뷰] LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR (0)	2025.04.01
[Paper 리뷰] CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions (0)	2025.03.31
[Paper 리뷰] Whisper: Robust Speech Recognition via Large-Scale Weak Supervision (0)	2025.03.01

최근에 올라온 글

최근에 달린 댓글

« 2026/03 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

1. Introduction

2. Method

- Voice Activity Detection

- VAD Cut & Merge

- Whisper Transcription

- Forced Phoneme Alignment

- Multi-Lingual Transcription and Alignment

- Translation

- Word-Level Timestamps without Phoneme Recognition

3. Experiments

- Settings

- Results

'Paper > ASR' 카테고리의 다른 글

티스토리툴바