[Paper 리뷰] GST-BERT-TTS: Prosody Prediction without Accentual Labels for Multi-Speaker TTS using BERT with Global Style Tokens

티스토리 뷰

Paper/TTS

[Paper 리뷰] GST-BERT-TTS: Prosody Prediction without Accentual Labels for Multi-Speaker TTS using BERT with Global Style Tokens

feVeRin 2025. 9. 1. 08:22

GST-BERT-TTS: Prosody Prediction without Accentual Labels for Multi-Speaker TTS using BERT with Global Style Tokens

Text-to-Speech에서 pitch-accent language에 대한 prosody prediction은 중요함
GST-BERT-TTS
- Global Style Token의 speaker-specific style embedding을 BERT의 token embedding에 integrate
- Accent label-free setting에서도 speaker-aware fundamental frequency를 predict 하고 $f_{0}$-BERT를 extend 하여 speech expressiveness를 향상
논문 (INTERSPEECH 2025) : Paper Link

1. Introduction

FastSpeech2와 같은 Text-to-Speech (TTS) model은 일반적으로 text를 phoneme sequence로 convert한 다음, acoustic model과 vocoder를 통해 speech waveform을 생성함
- 이때 BERT를 TTS model에 incorporate하면 text analysis와 prosody prediction을 향상할 수 있음
  1. 특히 Japanese와 같은 pitch-accent language에서 prosody는 intelligibility와 naturalness에 큰 영향을 미침
  2. 이를 위해 accent label이나 hand-crafted rule을 적용할 수 있지만, context에 따른 accent 변화를 효과적으로 반영하지 못함
- 한편으로 $f_{0}$-BERT는 text input에서 mora-level fundamental frequency $f_{0}$를 predict 할 수 있음
  - BUT, 해당 model은 single-speaker에 focus하므로 speaker-aware prosody modeling이 어려움
- 그 외에도 YourTTS와 같이 speaker variability를 처리하기 위해 one-hot speaker ID embedding을 고려할 수도 있지만, unseen speaker에 generalize 하기 어려움
  1. 이때 Global Style Token (GST)를 사용하면 speaker-agnostic representation을 얻을 수 있음
  2. BUT, GST embedding이 properly control되지 않는 경우 overly flat prosody를 exhibit 할 수 있음

-> 그래서 speaker-aware prosody prediction을 위해 GST와 BERT를 integrate한 GST-BERT-TTS를 제안

GST-BERT-TTS
- GST의 speaker-specific style embedding을 BERT의 token embedding에 integrate
- $\log f_{0}$, energy, duration을 추가적으로 predict하여 speech expressiveness를 향상

< Overall of GST-BERT-TTS >

GST와 BERT를 combine 한 speaker-aware multi-speaker TTS model
결과적으로 기존보다 뛰어난 성능을 달성

2. Method

GST-BERT-TTS는 $\log f_{0}$-BERT framework를 extend 하여 GST에서 추출한 speaker-specific style embedding을 BERT embedding layer에 integrate 함
1. Training TTS Module
  - 먼저 TTS model은 multi-speaker speech dataset을 통해 training 되고, GST module은 mel-spectrogram으로부터 style embedding을 추출함
    - 이후 해당 embedding은 hidden representation에 add 됨
  - Variance adaptor는 $\log f_{0}$, energy, duration을 predict 하여 hidden state의 length를 regulate 함
    - Variance adaptor는 GST를 incorporate 하여 speaker-adapted prosodic information을 receive 함
2. Training BERT Module
  - BERT는 text input을 통해 training 되어 해당하는 prosodic parameter ($\log f_{0}$, energy, duration)을 output 함
  - Speaker-aware prosody prediction을 위해 GST style embedding이 BERT token embedding에 incorporate 됨
3. Integrated Inference
  - 추론 시 trained BERT model은 prosodic parameter를 predict 하고, 이를 variance adaptor에 input 하여 explicit accent label 없이 multi-speaker prosody generation을 수행함

- Integration of GST into BERT Embeddings

Speaker-aware prosody prediction을 위해 논문은 GST-based style embedding을 BERT token embedding에 incorporate 함
- 이때 GST embedding과 BERT embedding은 dimension이 다르므로, GST embedding $\mathbf{s}$를 compatible dimension으로 transform 하기 위해 linear layer를 적용함:
  (Eq. 1) $\mathbf{s}'=W_{s}\mathbf{s},\,\,\,\mathbf{s}'\in\mathbb{R}^{d_{e}}$
  - $W_{s}\in\mathbb{R}^{d_{e}\times d_{s}}$ : trainable weight matrix
- 해당 transformation은 GST embedding이 BERT token embedding과 properly align 되어 BERT가 speaker-specific prosody informaiton을 활용할 수 있도록 보장함
- 이후 transformed style embedding $\mathbf{s}'$은 각 token embedding $\mathbf{e}_{BERT}$에 add 되어 BERT에 input 됨:
  (Eq. 2) $\mathbf{e}'_{BERT}=\mathbf{e}_{BERT}+\mathbf{s}'$
- Style embedding $\mathbf{s}'$은 entire sequence에 uniformly apply 되어 input의 모든 token이 same speaker-adapted prosodic representation을 share 하도록 보장함
  - 특히 transformed GST-embedding은 BERT의 text processing 이전에 적용되어 speaker-dependent feature가 모든 subsequent prediction에 influence 하도록 함

- Prosody Parameter Prediction

BERT encoder는 각 token에 대해 hidden representation $\mathbf{h}_{BERT}$를 output 함
- Input sequence의 각 character는 predicted prosodic parameter $\hat{\mathbf{y}}$를 가짐
  - 이는 $\log f_{0}$, energy, duraiton을 포함함
- 해당 prediction은 dedicated linear layer를 통해 얻어짐:
  (Eq. 3) $\hat{\mathbf{y}}=\mathbf{W}\mathbf{h}_{BERT}$
  - $\mathbf{W}\in\mathbb{R}^{d_{p}\times d_{e}}$ : BERT hidden representation을 prosodic parameter space로 mapping 하는 trainable weight

- Loss Function

GST-BERT-TTS는 token classification과 prosody parameter prediction을 combine 한 loss function을 사용함:
(Eq. 4) $\mathcal{L}=\alpha_{t}\mathcal{L}_{token}+\alpha_{f}\mathcal{L}_{\log f_{0}}+\alpha_{e}\mathcal{L}_{energy}+\alpha_{d}\mathcal{L}_{duration}$

- Style Embedding Extraction and Application

GST embedding은 TTS module에서 pre-train 됨
- 여기서 speaker-specific embedding은 speaker의 all utterance-level GST vector에 대한 평균으로 얻어짐:
  (Eq. 5) $\mathbf{s}^{(i)}=\frac{1}{N_{i}}\sum_{j=1}^{N_{i}}\mathbf{s}_{j}^{(i)}$
- 추론 시 predicted prosody parameter ($\log f_{0}$, energy, duration)는 variance adaptor에 적용되어 TTS model 내에서 생성된 default parameter를 replace 함
- 한편으로 BERT는 large-scale textual data로 pre-train 되어 있으므로, rich linguistic context를 prosody prediction에 활용할 수 있음
  1. 특히 논문은 GST와 BERT를 integrate 하여 speaker-aware prosody prediction을 지원하면서 BERT 내의 embedded contextual knowledge도 활용함
  2. 이를 통해 model은 prosodic variation이 complex 한 multi-speaker scenario에서 diverse linguistic input에 대해 better-generalize 될 수 있음

3. Experiments

- Settings

Dataset : Japanese Multi-Speaker Corpus
Comparisons : $f_{0}$-BERT

- Results

GST-BERT-TTS는 뛰어난 prosody prediction 성능을 보임

Accent correctness 측면에서도 GST-BERT-TTS가 가장 우수함

MOS 측면에서도 우수한 성능을 보임

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] ControlSpeech: Towards Simultaneous and Independent Zero-Shot Speaker Cloning and Zero-Shot Language Style Control (0)	2025.09.14
[Paper 리뷰] PEFT-TTS: Parameter-Efficient Fine-Tuning for Low-Resource Text-to-Speech via Cross-Lingual Continual Learning (0)	2025.09.05
[Paper 리뷰] EATS-Speech: Emotion-Adaptive Transformation and Priority Synthesis for Zero-Shot Text-to-Speech (0)	2025.08.25
[Paper 리뷰] APTTS: Adversarial Post-Training in Latent Flow Matching for Fast and High-Fidelity Text-to-Speech (0)	2025.08.20
[Paper 리뷰] EE-TTS: Emphatic Expressive TTS with Linguistic Information (0)	2025.07.24

최근에 올라온 글

최근에 달린 댓글

« 2026/05 »
일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] GST-BERT-TTS: Prosody Prediction without Accentual Labels for Multi-Speaker TTS using BERT with Global Style Tokens

GST-BERT-TTS: Prosody Prediction without Accentual Labels for Multi-Speaker TTS using BERT with Global Style Tokens

1. Introduction

2. Method

- Integration of GST into BERT Embeddings

- Prosody Parameter Prediction

- Loss Function

- Style Embedding Extraction and Application

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바