[Paper 리뷰] ProsoSpeech: Enhancing Prosody with Quantized Vector Pre-training in Text-to-Speech

티스토리 뷰

Paper/TTS

[Paper 리뷰] ProsoSpeech: Enhancing Prosody with Quantized Vector Pre-training in Text-to-Speech

feVeRin 2024. 2. 20. 11:56

ProsoSpeech: Enhancing Prosody with Quantized Vector Pre-training in Text-to-Speech

Text-to-Speech에서 prosody 모델링을 위해서는 몇 가지 어려움이 있음
- 추출된 pitch에는 inevitable error가 포함되어 있어 prosody 모델링을 저해함
- Pitch, duration, energy와 같은 prosody의 다양한 특성은 서로 dependent 함
- Prosody의 high variability로 인해 prosody 분포를 fully shape 하기 어려움
ProsoSpeech
- Low-quality text와 speech data에 대해 pre-train 된 quantized latent vector를 도입
- Low-frequency band를 quatize 하고 Latent Prosody Vector (LPV)에서 prosody attribute를 compress 하는 word-level Prosody Encoder를 적용
- Word sequence가 주어지면 LPV를 예측하는 LPV predictor를 활용
논문 (ICASSP 2022) : Paper Link

1. Introduction

Text-to-Speech (TTS)는 많은 발전이 있었지만 여전히 prosody를 제어하는 것은 어려움
- Prosody prediction-based method는 pitch, duration, energy와 같은 prosody attribute를 추출하고 predictor module에서 input linguistic feature를 condition으로 하여 prosody를 예측함
- 이러한 기존 prosody 모델링 방식은 몇 가지 한계점이 있음
  1. Pitch contour를 추출하기 위해 external tool이 필요함
    - 추출된 pitch에는 v/uv decision, inaccurate $F0$와 같은 inevitable error가 포함되어 있음
    - 결과적으로 추출된 pitch는 prosody 예측 성능을 저하하고 TTS 모델의 최적화를 손상시킴
  2. Prosody attribute (pitch, duration, energy)를 개별적으로 모델링함
    - Prosody attribute는 서로 dependent 하고 함께 prosody를 생성함
    - 개별적으로 모델링하면 해당 관계가 손실되고 부자연스러운 prosody가 만들어짐
  3. Prosody는 high variability를 가지고 사람, 단어마다 다양하게 나타남
    - 고품질 TTS data는 상당히 제한적이기 때문에 prosody의 전체 분포를 shaping 하기 어려움

-> 그래서 low-quality TTS data에 대해 pre-train 된 quantized latent vector를 활용하여 prosody 모델링을 향상하는 ProsoSpeech를 제안

ProsoSpeech
- Pitch 추출의 error를 방지하고 prosody attribute 간의 dependency를 고려하는 word-level prosody encoder를 도입
  - Speech의 low-frequency band를 word boundary에 따라 word-level quantized Latent Prosody Vector (LPV)로 quantize
  - Vector quantization을 안정화하고 index collapse를 방지하는 $k$-means cluster-based codebook intialization
- Prosody를 모델링하기 위해, word-level text-sequence에 따라 condition 된 autoregressive LPV predictor를 도입
- Prosody 분포를 구성하기 위해, low-quality dataset에서 LPV predictor를 pre-train 하고 high-quality dataset에 대해 fine-tuning
  - 최종적으로 예측된 LPV에 따라 expressive speech를 생성

< Overall of ProsoSpeech >

Unpaired low-quality text, speech data에 대해 pre-train 된 quantized latent vector를 도입
Low-frequency band를 quatize 하고 LPV에서 prosody attribute를 추출하는 word-level prosody encoder의 활용
LPV predictor를 통한 자연스러운 prosody 모델링 및 고품질 음성 합성

2. Method

ProsoSpeech는 FastSpeech를 기반으로 하고, Word Encoder, Prosody Encoder, Autoregressive LPV Predictor 등을 도입
- 학습 과정에서는,
  1. Input text-sequence는 phoneme sequence와 word sequence로 변환되고, 각각 Phoneme/Word Encoder를 통해 linguisitc feature로 encoding 됨
  2. 이후 ground-truth mel-spectrogram의 low-frequency band는 linguisitic feature에 따라 condition 된 Prosody Encoder를 통해 quantized LPV로 encoding 됨
  3. 최종적으로 linguistic feature와 LPV를 Decoder에 전달하여 예측 mel-spectrogram을 생성하고, Mean Squared Error (MSE)와 SSIM loss를 사용하여 최적화됨
- 이 과정에서 prosody disentangled representation (LPV)는 speech에서 prosody를 disentangle 함으로써 얻어짐
  1. LPV sequence를 예측하기 위해, word sequence를 condition으로 autoregressive LPV Predictor를 학습
  2. 추가적으로 LPV Predictor를 pre-train 하는 large-scale text, audio corpus를 활용
- 추론 과정에서는,
  - Reference로 사용되는 ground-truth mel-spectrogram이 없으므로, LPV Predictor를 통해 LPV를 예측하고 expressive speech를 생성

- Prosody Encoder

Prosody Encoder는 word-level vector quantization bottleneck을 사용하여 speech에서 prosody를 disentangle함
- Prosody Encoder는 2-level을 가지고, 각 level은 ReLU activation, Layer Noramlization을 가지는 convolution stack으로 구성됨
  - 첫 번째 level은 mel-spectrogram을 word boundary에 따라 word-level hidden state로 compress
  - 두 번째 level은 word-level hidden state를 post-process
- 각 hidden state는 EMA-based vector quantization layer에 전달되어 word-level LPV sequence를 얻음
  - 이때 timbre (speaker identity)와 speech content는 각각 speaker embedding과 linguistic encoder (phoneme/word encoder)에 의해 제공되므로,
  - LPV는 vector quantization bottleneck을 통해 speaker-content independent prosody information만을 포함하게 됨
- 추가적으로 ProsoSpeech는 mel-spectrogram의 low-frequency band만을 input으로 사용하여 disentanglement의 어려움을 완화
  - Full band에 비해 almost complete prosody와 더 적은 timbre/content information을 포함하고 있기 때문
실제로 Prosody Encoder가 word-level mel-sepctrogram clip에서 prosody information을 추출하기 위해서는 많은 training step이 필요함
- 따라서 training 초기 단계에서 vector quantization 이전의 hidden state는 noisy 하고 meaningless 함
  - 이 경우 Prosody Encoder에서 index collapse가 발생할 수 있음
  - Embedding vector가 encoder output에 가깝고, 모델이 $e$의 제한된 수의 vector만을 사용한다는 것을 의미
  - 결과적으로 index collapse는 encoder의 expression ability를 크게 저해함
- 이를 해결하기 위해, $k$-means cluster-based centroid initialization을 도입:
  1. 처음 $20k$ step에서 vector quantization layer를 제거하여 Prosody Encoder가 bottleneck 없이 prosody information을 추출하도록 함
  2. $20k$ step 이후 $k$-means cluster center를 가지는 vector quantization layer의 codebook을 initialize
  3. Initialization 이후, prosody bottleneck을 위해 vector quantization layer를 추가

- Latent Prosody Vector Predictor (LPV Predictor)

Prosody Encoder를 통해 prosody representation을 얻을 수 있으므로, LPV sequence로부터 prosody를 모델링할 수 있음
- LPV Predictor는 self-attention 기반의 autoregressive architecutre를 채택
  - Text input을 사용하여 word-level LPV sequence를 예측하는 역할
- LPV sequence는 word sequence와 length가 동일하므로 word-level context feature를 condition으로 사용
  - 해당 condition은 LPV predictor의 context encoder를 통해 encoding 됨
- LPV predictor는 teacher forcing mode로 학습되어 추론 과정에서 autoregressive 하게 LPV를 예측함

- Pre-training and Fine-tuning

LPV predictor는 prosody representation을 모델링할 수 있지만, 아래의 이유로 accuracy가 떨어질 수 있음
- TTS dataset의 text training data가 충분하지 않아 context understanding이 부족하고 prosody와 text 간의 connection을 capture 하는 것이 어려움
- Speech/prosody training data가 충분히 크지 않으므로, prosody space가 sparse 해지므로 prosody 분포 추정이 어려움
이를 위해 추가적인 pure text data와 low-quality speech data를 모두 사용하는 pre-training method를 도입
- Text pre-training의 경우, LPV predictor의 context encoder는 0.15 masking probability를 활용한 BERT-like mask prediction으로 학습됨
- Low-quality audio pre-training의 경우, LPV predictor는 noisy audio에서 encoding 된 LPV sequence로 pre-train 됨
- 위의 pre-train 과정을 수행한 다음, high-quality TTS dataset에 대해 LPV predictor를 fine-tuning 함
- 결과적으로 최종적인 training pipeline은,
  1. FastSpeech 기반의 TTS 모델 학습,
  2. Unpaired text를 사용한 context encoder pre-training,
  3. Low-quality speech를 사용한 LPV predictor pre-training,
  4. High-quality TTS dataset에 대한 LPV predictor fine-tuning으로 구성

3. Experiments

- Settings

Dataset : Mandarin dataset (Internal)
Comparisons : FastSpeech, FastSpeech2

- Results

MOS 측면에서 ProsoSpeech는 기존의 TTS 모델보다 뛰어난 합성 품질을 발휘
- Pitch accuracy 측면에서 ProsoSpeech가 생성한 audio의 $D_{pit}$가 가장 작으므로 ProsoSpeech는 효과적인 prosody 모델링 능력을 가진다고 볼 수 있음
- Duration accuracy도 ProsoSpeech의 예측이 가장 ground-truth와 가깝게 나타남

ProsoSpeech에 대한 ablation study를 수행한 결과를 확인해 보면,
- Text, speech pre-training은 모두 pitch/duration accuracy를 향상할 수 있음
- Prosody encoder에 대한 $k$-means initialization은 prosody 개선에 효과적임
- 128의 codebook size 보다 작은 codebook을 사용하는 경우 성능이 저하될 수 있음

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] Meta-StyleSpeech: Multi-Speaker Adaptive Text-to-Speech Generation (0)	2024.02.23
[Paper 리뷰] FedSpeech: Federated Text-to-Speech with Continual Learning (0)	2024.02.22
[Paper 리뷰] ReFlow-TTS: A Rectified Flow Model for High-Fidelity Text-to-Speech (0)	2024.02.15
[Paper 리뷰] EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance (0)	2024.02.14
[Paper 리뷰] EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture (0)	2024.02.10

최근에 올라온 글

최근에 달린 댓글

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] ProsoSpeech: Enhancing Prosody with Quantized Vector Pre-training in Text-to-Speech

ProsoSpeech: Enhancing Prosody with Quantized Vector Pre-training in Text-to-Speech

1. Introduction

2. Method

- Prosody Encoder

- Latent Prosody Vector Predictor (LPV Predictor)

- Pre-training and Fine-tuning

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바