[Paper 리뷰] Textually Pretrained Speech Language Models

티스토리 뷰

Paper/Language Model

[Paper 리뷰] Textually Pretrained Speech Language Models

feVeRin 2024. 3. 31. 09:10

Textually Pretrained Speech Language Models

Speech language model은 textual supervision 없이 acoustic data 만을 처리하고 생성함
Textually Warm Initialized Speech Transformer (TWIST)
- Pretrained textual languaga model의 warm-start를 사용하여 speech language model을 training
- Parameter 수와 training data 측면에서 가장 큰 speech language model을 제시
논문 (NeurIPS 2023) : Paper Link

1. Introduction

음성에는 단순한 textual context 이상의 정보가 포함되어 있지만, 대부분의 spoken language understanding system은 textuak form으로 제한됨
- 특히 speech language modeling, speech synthesis 등이 지속적으로 발전했음에도 불구하고, text는 web에서 가장 dominant 한 language modality로 사용됨
  - 이는 Text Language Model (LM)과 달리, SpeechLM의 구성을 제한함
- 한편 Large textual LM은 대규모의 text corpora에서 train 됨으로써 다양한 task를 효과적으로 지원함
  - 따라서 이때 TextLM을 사용하여 SpeechLM을 개선하는 방식을 고려해 볼 수 있음
- 이때 두 LM은 완전히 다른 granularity level에서 동작하지만, speech와 text는 서로 밀접하게 연관되어 있으므로 해당 modality 전반에 걸쳐 모델을 transferring 하는 것은 적합할 수 있음

-> 그래서 pretrained Textual LM으로 SpeechLM을 initialize 하는 Textually Warm Initialized Speech Transformer (TWIST)를 제안

TWIST
- Textual LM을 활용하는 간단한 approach를 통해 주관적, 정량적 평가 모두에서 일관적인 성능 개선 효과를 얻음
- 13B의 parameter 수와 150k speech hours의 대용량 training data를 사용하여 가장 큰 SpeechLM을 제시함
- 추가적으로 long contextual spoken sentence 모델링을 평가할 수 있는 StoryCloze benchmark를 공개

< Overall of TWIST >

TWIST는 SpeechLM에 TextLM을 warm initialize 한 speech transformer
Parameter 수와 training data 측면에서 가장 큰 규모의 SpeechLM을 training 함
결과적으로 우수한 성능과 context capture 능력을 보임

2. Using Textual LMs to Improve SpeechLMs

- Background

TWIST는 Generative Spoken Language Modeling (GSLM) framework를 따름
- GSLM pipeline은 speech tokenizer, SpeechLM, vocoder의 3가지 주요 module로 구성됨
  - 각각의 module은 개별적으로 training 됨
- 이때 language model을 ignore 하고 quantized token을 vocoder module에 직접 공급하면 speech resynthesis가 가능함
Speech Tokenizers
- Speech tokenizer는 raw speech를 discrete representation으로 encoding 함
  - 일반적으로는 speech를 continuous representation으로 encode 한 다음, representation을 quantize 하여 discrete token sequence를 얻는 방식을 사용함
- Audio sample의 domain을 $\mathcal{X}\subset \mathbb{R}$이라고 하면, raw signal에 대한 representation은 sample sequence $x =(x_{1},...,x_{T})$로써 나타낼 수 있음
  - $x_{t} \in \mathcal{X}, \,\,\, \forall \, 1 \leq t \leq T$
- Encoder network $f$는 speech utterance를 input으로하여 $f(x)=(v_{1},...v_{T'})$과 같이 low-frequency에서 sampling 된 spectral representation sequence를 output 함
  1. $T'$은 encoder의 frame rate에 의해 결정되고, encoder network $f$의 구조에 대해서는 어떠한 가정도 하지 않음
  2. 따라서 Contrastive Predictive Coding, wav2vec, HuBERT 등의 여러 encoder를 사용할 수 있음
    - 논문에서는 HuBERT를 사용함
- 여기서 encoder network에 의해 학습된 representation을 일반적으로 continuous 하므로, $k$-mean algorithm을 모델의 output에 적용하여 $z=(z_{1},...,z_{T'})$의 discrete token을 얻음
  - $z$의 각 element $z_{i}$는 positive integer이고, $z_{i} \in \{1,..., K\}, \,\, 1 \leq i \leq T'$
  - $K$ : vocabulary $\mathcal{Z}=\{1,...,K\}$의 discrete token 개수
Language Models
- Language Model은 token sequence $p(w_{1},...,w_{n})$의 joint probability하에서 학습됨
  - 이때 각 token $w_{i}$는 tokenizer에 의해 define 된 vocabulary $\mathcal{W}$에 속함
- 이때 Chain rule을 적용하여 sequence의 joint probability를 conditional probability의 곱으로 나타낼 수 있음:
  (Eq. 1) $p(w_{1},...,w_{n})=\prod_{i=1}^{n}p(w_{i}|w_{i-1},...,w_{1})$
- $\theta$로 parameterize 된 Neural LM은 probability $p_{\theta}(w_{i}|c(w_{i-1},...,w_{1}))$을 모델링하는 것을 목표로 함
  - 이때 network parameter $\theta$는 예측과 실제 분포 사이의 negative log likelihood를 최소화하는 방식으로 학습됨:
  (Eq. 2) $\ell(\theta,w)=-\sum_{i=1}^{n}\log p_{\theta}(w_{i}|c(w_{i-1},...,w_{1}))$
  - $c$ : previous token, $\theta$ : 일반적으로 centered Gaussain과 같은 pre-defined 분포에서 sampling 된 값으로 initialize 됨
Speech Language Models (Speech LMs)
- SpeechLM은 speech tokenizer를 사용하여 추출된 discrete speech token $z$를 사용하여 학습됨
  - $z$를 사용함으로써 SpeechLM은 textual transcription에 access 하지 않고도 spoken data를 모델링할 수 있음
- 해당 modeling framework는 prosodic feature, speaker identity, natural dialogue 등도 capture 할 수 있음
Token-to-Speech Module
- Token-to-Speech module은 discrete token을 raw waveform으로 변환하는 역할
  - 이전 연구에서는 주로 Tacotron2를 기반으로 WaveGlow를 사용했음
- 한편으로 HiFi-GAN 기반의 token-based vocoder는 효율적이고 고품질의 합성을 지원할 수 있음
  - 따라서 논문은 token-to-speech module로써 HiFi-GAN을 채택함

Generative Spoken Language Modeling Pipeline

- Textually Warm-Initialized Speech Transformer Language Models

TWIST는 OPT, LLaMA와 같은 pretrained TextLM으로 initialize 된 SpeechLM을 training 하는 방식임
- 이를 위해 TWIST는
  1. 먼저 original text vocabulary $\mathcal{W}$를 speech token set인 $\mathcal{Z}$로 대체하고, tokenizer를 speech based tokenizer로 설정함
  2. 이후 text lookup table을 speech token에 대해 randomly initialize 된 embedding table로 교체함
    - 이때 network의 나머지 부분은 initialization time 동안 변경되지 않음
  3. 마지막으로 TWIST는 speech data를 사용하여 전체 SpeechLM을 training 함
- Speech token은 20~40ms window에서 동작하지만 text tokenizer는 sub-word와 같은 longer concept에 span 되어 있으므로 textual model로 speech model을 initialize 하는 것이 적합하지 않을 수 있음
  - BUT, 결과적으로 TWIST를 사용했을 때 speechLM이 textual LM initialization의 장점을 취할 수 있는 것으로 나타남

3. Experiments

- Settings

Dataset : LibriSpeech (LS), LibriLight (LL), VoxPopuli, Spotify Podcasts, People dataset
Comparisons : HuBERT

- Results

SpeechLMs Benefit from Warm Initialization using TextLMs
- 서로 다른 frequency와 token 수에 대해 warm initialization을 사용한 TWIST와 cold initialization을 사용한 COLD-INIT을 비교
- 결과적으로 TWIST 방식이 모든 지표에서 우수한 성능을 보임
- 특히 downsampling factor가 큰 speech token을 사용하면 sWUGGY와 sBLIMP 결과가 좋아짐

Token 수와 Downsampling Factor (Frequency)에 따른 Warm/Cold Initialization 성능 비교

Scaling Improves SpeechLMs
- 모델과 dataset scaling이 전체 성능에 미치는 영향을 확인해 보면
- TWIST를 사용하여 initialize 된 SpeechLM이 COLD-INIT을 사용하는 것보다 일관되게 더 나은 성능을 발휘함
- Dataset과 모델 size를 늘리면 모델 성능도 향상됨
  - 이때 dataset의 10%만 사용하는 TWIST가 100%의 dataset을 사용하는 COLD-INIT과 비슷한 성능을 보임
- 결과적으로 pretraining이 downstream task에 대해 sample efficiency를 향상할 수 있음
TWIST Converges Faster
- Textual pretraining이 모델 수렴에 미치는 영향을 확인해 보면
- TWIST를 사용한 모델은 COLD-INIT의 $1/4$ step만으로도 동일한 수준의 perceplexcity를 얻음

(a) 모델 Size에 따른 성능 (b) Training Step에 따른 성능

Not All Warm Initializations are Equally Important
- 다른 pretrained LM인 ImageGPT를 SpeechLM에 적용해 보았을 때
- Textual initialization과 달리 해당 방식은 COLD-INIT 보다도 낮은 성능을 보임
Speech Large Language Models
- LLaMA 7B/13B로 initialize 하여 대규모 SpeechLM인 TWIST-7B/13B을 얻음
- 대규모 SpeechLM을 구성했을 때, 성능이 추가적으로 향상되는 것으로 나타남
  - PPL 기준으로 ~8/10% 개선, sWUGGY 기준으로 ~1.7/2.5% 개선

Spoken StroyCloze
- SpeechLM의 contextual understanding을 평가하기 위해 Spoken StoryCloze benchmark를 실험
- TWIST는 fine-grained relation (SSC)에 비해 continuation coherence (TSC) 측면에서 우수한 성능을 보임
- 특히 human performance와 비교했을 때 TSC benchmark에서 약 15%의 차이를 나타냄

Human Evaluation
- 3초의 prompt를 사용하여 각 모델에서 ~10초의 speech continuation을 생성해 보면
- MOS 측면에서 TWIST-7B가 TWIST-1.3B, COLD-INIT-1.3B 보다 우수한 결과를 보임

Pre-defined prompt에 대해 생성된 결과의 예시를 확인해 보면
- COLD-INIT-1.3B의 경우 grammatical error가 발생하고, 주어진 topic에서 벗어나지 못하는 것으로 나타남
- 반면 TWIST-7B의 경우 semantically richer continuation을 제공할 수 있음

'Paper > Language Model' 카테고리의 다른 글

[Paper 리뷰] VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (0)	2024.06.15
[Paper 리뷰] Efficient Neural Music Generation (0)	2024.05.11
[Paper 리뷰] AudioLM: A Language Modeling Approach to Audio Generation (0)	2024.03.10
[Paper 리뷰] MusicLM: Generating Music From Text (0)	2024.03.09
[Paper 리뷰] Pengi: An Audio Language Model for Audio Tasks (0)	2024.03.07

최근에 올라온 글

최근에 달린 댓글

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Textually Pretrained Speech Language Models

Textually Pretrained Speech Language Models

1. Introduction

2. Using Textual LMs to Improve SpeechLMs

- Background

- Textually Warm-Initialized Speech Transformer Language Models

3. Experiments

- Settings

- Results

'Paper > Language Model' 카테고리의 다른 글

티스토리툴바