[Paper 리뷰] Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

티스토리 뷰

Paper/Language Model

[Paper 리뷰] Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

feVeRin 2025. 1. 8. 16:31

Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

Minimal supervision으로 train 할 수 있는 multi-speaker text-to-speech model이 필요함
SPEAR-TTS
- Text to High level semantic token (Reading), Semantic token to Low-level acoustic token (Speaking)의 2가지 discrete speech representation을 combining 하여 text-to-speech를 sequence-to-sequence task로 casting
- 특히 abundant audio-only data를 사용하여 Speaking module을 training 하고, pretraining과 backtranslation을 combination 하여 Reading component에 대한 parallel data 의존성을 줄임
- 추가적으로 speaker identity를 control 하기 위해 example prompting을 도입하여 explicit speaker representation이나 label 없이도 generalization이 가능
논문 (ACL 2023) : Paper Link

1. Introduction

Text-to-Speech (TTS) system을 training 하기 위해서는 상당한 양의 parallel data가 필요함
- BUT, 다양한 accent, demographic 등을 포함한 high-quality TTS dataset은 극히 적은 반면 audio-only data는 online, podcast, radio 등에서 쉽게 수집될 수 있음
  - 따라서 realistic TTS를 위해서는 parallel data에 대한 의존성은 줄이면서 audio-only data를 활용할 수 있어야 함
- 이를 위해 최근의 textless modeling을 활용할 수 있음
  1. 특히 AudioLM은 high-level semantic token과 low-level acoustic token의 2가지 discrete token을 활용하여 audio mapping을 수행함
  2. 즉, TTS는 text를 semantic token으로 translating하고 semantic을 acoustic token으로 translating 하는 sequence-to-sequence (seq2seq) task로 취급될 수 있음
- 해당 approach를 활용하면 text를 intermediate semantic token representation으로 mapping 하는 과정(Reading)과 speech를 생성하는 과정(Speaking)을 학습하는데 필요한 supervision을 decouple할 수 있음
  1. Reading stage는 parallel text-audio data에 의존하지만, Speaking component를 학습하는데 필요한 audio token은 self-supervised audio model을 통해 생성되므로 unlabled data를 활용할 수 있기 때문
  2. 결과적으로 available parallel data와 independent하게 generated speech의 quality, diversity를 향상할 수 있음

-> 그래서 seq2seq modeling과 audio-only data를 활용하여 TTS system의 training supervision을 줄인 SPEAR-TTS를 제안

SPEAR-TTS
- BART-style pretraining과 backtranslation을 combine 하여 SPEAR-TTS training에 필요한 parallel supervision을 줄임
  - 각 stage를 seq2seq 문제로 casting 하여 standard Transformer model을 활용 가능
- 추가적인 voice control을 위해 textual language model의 example prompting을 활용
  1. 즉, target voice를 represent 하는 audio clip으로 speaking model을 conditioning 하여 utterance를 생성할 때 example voice를 steering 할 수 있도록 함
  2. 결과적으로 single-speaker parallel data에 대해서도 controllable multi-speaker TTS system을 구축 가능

< Overall of SPEAR-TTS >

Seq2Seq modeling을 활용하여 training supervision을 줄인 multi-speaker TTS model
결과적으로 15분의 parallel data 만으로도 기존보다 뛰어난 합성 품질을 달성

2. Discrete Speech Representations

먼저 AudioLM에서 적용된 2가지 self-supervised audio representation에서 시작하자
- 각각의 representation은 reconstruction quality-bitrate trade-off의 정반대에 위치하고 있음
- Acoustic token은 high-bitrate로 인해 high-fidelity audio generation이 가능한 반면, semantic token은 low-bitrrate로 인해 long-span coherence가 나타나기 쉬움

- Semantic Tokens

Semantic token은 coarse, high-level conditioning을 제공하여 acoustic token을 생성하는 것을 목표로 함
- 따라서 linguistic content가 salient 해야 하고 speaker identity, acoustic detail 같은 para-linguistic information은 remove 되어야 함
- 이때 해당 representation을 얻기 위해 w2v-BERT와 같은 self-supervised speech representation model을 training 할 수 있음
- Training 이후에는 specific layer의 mean-variance normalized output에 $k$-means clustering을 적용하여 사용함
  - 여기서 논문은 centroid index를 discrete token으로 사용

- Acoustic Tokens

Acoustic token은 acoustic detail에 대한 high-fidelity reconstruction을 제공하는 discrete audio representation에 해당함
- 이때 residual quantizer를 통해 speech를 few discrete unit으로 compress 하고 reconstruct 하는 SoundStream codec을 활용할 수 있음
- 특히 sequence에서 residual quantizer의 hierarchy를 represent 하기 위해 서로 다른 level에 해당하는 token을 interleaving 하여 flatten 함
  - 결과적으로 SoundStream을 통해 audio를 acoustic token으로 변환하고 acoustic token에서 audio를 resynthesize 함

3. SPEAR-TTS Overview

SPEAR-TTS는 text를 conditioning form으로 enabling 하여 AudioLM을 확장하여 구성됨
- 구조적으로 SPEAR-TTS는 크게 two-stage로 구성됨:
  1. First Stage $\mathcal{S}_{1}$ : text input을 discrete semantic token sequence로 translate 하는 역할
  2. Second Stage $\mathcal{S}_{2}$ : semantic token을 acoustic token으로 mapping 하고 SoundStream decoder를 통해 speech로 decoding 하는 역할
- 결과적으로 $\mathcal{S}_{1}$은 text를 semantic token이 제공하는 internal representation에 mapping 하는 방법을 학습하고 (Reading), $\mathcal{S}_{2}$는 해당 intermeditate representation으로부터 speech를 생성함 (Speaking)
- Intermediate representation으로 semantic token을 사용하면 다음의 장점이 있음:
  1. Semantic token은 speech에 대한 high-level representation을 제공함
    - 즉, text/acoustic token 간의 direct mapping 보다 text transcript에서 semantic token 간의 mapping을 학습하는 것이 더 쉬움
  2. Semantic/acoustic token은 모두 self-supervised model에서 얻어지므로 $\mathcal{S}_{2}$는 audio-only data를 통해 training 될 수 있음
    - Available audio-only data는 parallel data 보다 많으므로 $\mathcal{S}_{2}$를 training 하기 더 쉬워짐
- 따라서 $\mathcal{S}_{1}, \mathcal{S}_{2}$를 separating 하여 succinct semantic token을 기반으로 동작하는 denoising pretext task를 통해 $\mathcal{S}_{1}$을 pretrain 한 다음, $S_{2}$에는 audio-only data를 사용할 수 있음

4. $\mathcal{S}_{1}$: Improving Supervision Efficiency

First stage $\mathcal{S}_{1}$은 tokenized text를 semantic token으로 mapping 하고, parallel text-semantic token data를 사용하여 train 될 수 있음
- 이를 위해 논문은 text-audio TTS dataset을 사용하여 audio에서 semantic token을 추출함
- 결과적으로 $\mathcal{S}_{1}$은 encoder-decoder 또는 decoder-only Transformer architecture로 구현될 수 있는 seq2seq task로 reduce 됨
- BUT, transformer seq2seq model을 training 하기 위해서는 상당한 parallel data가 필요함
  - 따라서 SPEAR-TTS는 target domain pretraining과 backtranslation을 도입함

- Pretraining

논문은 BART와 같이 denoising task에서 encoder-decoder Transformer를 pretrain 함
- 이때 model에는 uncorrupted token sequence와 original semantic token sequence가 제공됨
  - 해당 pretraining은 parallel data가 필요하지 않고 large audio-only dataset을 통해 수행됨
- 일반적으로 corruption method로 random substitution, deletion, masking individual token/entire span token 등을 고려할 수 있음
  - SPEAR-TTS에서는 individual token을 dependently random deleting 하는 것이 잘 동작함
- Model $\mathcal{P}$를 pretraining 한 다음, $\mathcal{S}_{1}$ task에 맞게 finetuning 해야 함
  - 이를 위해 encoder의 upper layer와 decoder의 모든 parameter를 freeze 하고 decoder-encoder cross-attention layer의 parameter를 제외한 encoder의 lower layer를 update 함

- Backtranslation

동일한 text를 accent, prosody, emotional content 등을 변경하여 다양한 audio를 rendering 할 수 있음
- 해당 one-to-many relationship은 TTS를 highly asymmetric 하게 만드는 원인임
  - 이때 backtranslation을 통해 available parallel data를 기반으로 speech-to-text model을 학습한 다음, audio-only corpus에서 synthetic parallel data를 생성할 수 있음
- 특히 SPEAR-TTS는 two-stage architecture로 인해 semantic token과 text 간의 translation으로 구현될 수 있음
  1. 이를 통해 raw audio나 long acoustic token sequence를 전혀 처리하지 않을 수 있으므로 computational complexity가 절감됨
  2. Backward-direction model을 학습할 때 동일한 semantic token-level pretraining을 활용할 수 있음
- Backtranslation model을 얻기 위해서는 앞선 pretrained model $\mathcal{P}$에서 encoder를 freeze 하고 decoder만을 finetuning 해야 함
  1. 이후 해당 model을 사용하여 audio-only data를 transcribe 한 다음, synthetically generated parallel data를 사용하여 TTS system의 first stage를 training 함
    - 이는 $\mathcal{P}$의 another copy를 finetuning 하여 얻어짐
  2. Synthetic data에 대한 finetuning 이후 original parallel data에 대한 finetuining을 수행함

5. $\mathcal{S}_{2}$: Controlling the Generation Process

Second stage model $\mathcal{S}_{2}$는 semantic token을 acoustic token으로 mapping 함
- 이를 training 하기 위해 audio-only dataset의 각 utterance에서 semantic/acoustic token sequence pair를 추출함
  - 이후 두 token sequence 간에 seq2seq translation을 수행하는 Transformer model을 training 함
- Second stage에서는 tempo, recording condition 등이 randomly varying 하는 utterance를 생성하여 training data에서 observe 된 characteristic의 distribution을 reproducing 함
  - $\mathcal{S}_{1},\mathcal{S}_{2}$는 independently training 되므로 $\mathcal{S}_{1}$이 single-speaker dataset으로 training 되더라도 speech diversity를 preserve 할 수 있음
- SPEAR-TTS는 speaker voice characteristic을 control 하기 위해 AudioLM의 2가지 특성을 combine 함:
  1. AudioLM은 speech prefix가 semantic token으로 solely represent 될 때마다 매번 서로 다른 random voice를 sampling 하여 continuation을 generation 함
  2. 한편으로 conditioning에 acoustic token이 포함되는 경우, AudioLM은 continuation generation 시 acoustic token이 capture 한 voice characteristic을 maintain 함
- 따라서 논문은 training 중에 위의 ability를 explicitly incorporate 함
  1. 먼저 training 중에 각 training example에서 2개의 non-overlapping window를 randomly select 하여 semantic/acoustic token sequence를 계산함
  2. 이후 (a) prompt의 semantic token, (b) target의 semantic token, (c) prompt의 acoustic token, (d) target의 acoustic token 순으로 sequence를 concatenate 함
  3. $\mathcal{S}_{2}$의 training에서 (a), (b), (c)는 prefix로 사용되고, model은 prompt의 acoustic token에 의해 capture 된 speaker identity를 preserving 하면서 target acoustic token (d)를 생성하는 법을 학습함
    - 추론 시 (a), (b), (c)는 input으로 제공되고 (d)는 autoregressively generate 됨
  4. 추가적으로 model에 expected discontinuity를 inform 하기 위해 special separator를 추가함
    - 이를 통해 boundary artifact를 방지할 수 있음
- $\mathcal{S}_{2}$에서 생성된 speech sample에는 background noise가 포함될 수 있으므로, 추론 시 synthesized speech의 noise level을 control 하기 위해 다음의 2가지 방법을 고려해야 함:
  1. Prompted generation의 경우 cleaner speech가 포함된 prompt를 select 해야 함
  2. Stochastic sampling을 통해 동일한 input에 대해 여러 sequence를 생성한 다음, no-reference audio quality metric을 통해 noise가 적은 sample을 select 할 수 있음
    - 이를 위해 논문은 DNSMOS와 같은 MOS estimator를 사용함

Controlling Generation with Example Prompting

6. Experiments

- Settings

Dataset : LibriLight, LJSpeech, LibriTTS
Comparisons : FastSpeech2, YourTTS, VALL-E

- Results

Intelligibility and Supervision Efficiency
- $\mathcal{S}_{1}$에 대해 다음의 training setting을 고려할 수 있음:
  1. (a) : parallel data를 사용한 scratch training
  2. (b) : parallel data를 사용하여 pretrained checkpoint $\mathcal{P}$를 finetuning
  3. (c) : pretrained checkpoint $\mathcal{P}$를 finetuning 하여 backtranslation model을 얻은 다음, synthetically generated data로 forward model을 scratch training
  4. (d) : (c)와 비슷하지만 $\mathcal{P}$를 finetuning 하여 얻어지는 backward/forward model을 활용
- 결과적으로 intelligibilty 측면에서 parallel data의 양이 줄어드는 경우, (a)는 high error rate를 보이는 반면 (b)의 pretraining으로 인해 SPEAR-TTS는 낮은 CER을 유지할 수 있음
- (c), (d)의 backtranslation 역시 parallel data 양이 감소하는 경우에도 낮은 CER을 유지할 수 있도록 함
  - 즉, fixed decoder를 사용하면 backtranslation을 통해 얻은 synthetically generated training data의 noisy nature를 효과적으로 처리할 수 있음

Prompted Generation
- Zero-shot scenario에서 SPEAR-TTS는 92.4%의 높은 top-1 accuracy를 달성함

다른 zero-shot model과 비교했을 때도 SPEAR-TTS는 15분의 parallel data 만으로도 높은 similarity를 달성함

Subjective Evaluation
- MOS 측면에서 SPEAR-TTS의 성능이 가장 뛰어남

Prompted generation의 경우에도 SPEAR-TTS가 가장 높은 MOS를 달성함

'Paper > Language Model' 카테고리의 다른 글

[Paper 리뷰] Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer (0)	2025.01.26
[Paper 리뷰] SpeechX: Neural Codec Language Model as a Versatile Speech Transformer (0)	2025.01.25
[Paper 리뷰] VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild (0)	2024.07.20
[Paper 리뷰] TacoLM: Gated Attention Equipped Codec Language Model are Efficient Zero-shot Text-to-Speech Synthesizers (0)	2024.07.16
[Paper 리뷰] Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale (0)	2024.07.06

최근에 올라온 글

최근에 달린 댓글

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

1. Introduction

2. Discrete Speech Representations

- Semantic Tokens

- Acoustic Tokens

3. SPEAR-TTS Overview

4. $\mathcal{S}_{1}$: Improving Supervision Efficiency

- Pretraining

- Backtranslation

5. $\mathcal{S}_{2}$: Controlling the Generation Process

6. Experiments

- Settings

- Results

'Paper > Language Model' 카테고리의 다른 글

티스토리툴바