[Paper 리뷰] PL-TTS: A Generalizable Prompt-based Diffusion TTS Augmented by Large Language Model

티스토리 뷰

Paper/TTS

[Paper 리뷰] PL-TTS: A Generalizable Prompt-based Diffusion TTS Augmented by Large Language Model

feVeRin 2024. 10. 12. 11:32

PL-TTS: A Generalizable Prompt-based Diffusion TTS Augmented by Large Language Model

Style-controlled Text-to-Speech를 위해 text style description을 사용할 수 있음
PL-TTS
- Large Language Model로 embed 된 prompt와 diffusion-based Text-to-Speech model을 결합
- 추가적으로 합성 품질과 style controllability를 향상하기 위해 Large Language Model과 diffusion framework를 fine-tuning
논문 (INTERSPEECH 2024) : Paper Link

1. Introduction

Controllable expressive Text-to-Speech (TTS)를 위해 fundamental frequency나 style token을 활용할 수 있지만, user가 적절한 reference speech를 선정해야 하므로 time-consuming 하고 user-friendly 하지 않음
- 한편으로 text prompt를 기반으로 generation을 guide 하는 prompt-based TTS를 활용하면 우수한 controllability를 달성할 수 있음
  - 대표적으로 PromptTTS, InstructTTS 등은 style, content description prompt를 additional input으로 사용하여 expressive speech를 합성함
- 이때 일반적으로 prompt는 speaker gender, tone, speed와 같은 desired characteristic을 describing 함
  1. 즉, pre-trained natural language processor를 사용하여 text description을 처리하고 output embedding을 speech synthesizer에 전달하여 expressive TTS를 수행함
    - BUT, PromptStyle, PromptTTS++ 등은 여전히 text style prompt를 처리하기 위해 BERT에 의존함
  2. 따라서 기존의 BERT-based style predictor 대신 Large Language Model (LLM)을 사용하면 prompt processing에 대한 robustness를 향상할 수 있음

-> 그래서 LLM을 기반으로 prompt-based TTS을 개선한 PL-TTS를 제안

PL-TTS
- Gradient-based, Generator-based learning objective를 결합한 joint learning framework로써 diffusion-based TTS model을 채택
  - 해당 framework를 통해 다양한 style condition으로 인해 발생하는 합성 품질 저하를 완화
- Llama2 model을 prompt tuning으로 fine-tuning 하여 style embedding을 추출하고, LLM의 semantic understanding capbility를 활용하여 unseen data에 대한 generalizability를 향상

< Overall of PL-TTS >

LLM과 diffusion-based TTS framework를 결합한 prompt-based TTS model
결과적으로 기존보다 뛰어난 합성 품질과 controllability를 달성

2. Method

Stochastic calculus를 기반으로 개선된 Denoising Diffusion Probabilistic Model (DDPM)은 complex data distribution modeling에서 우수한 성능을 보이고 있음
- 이때 diffusion model의 diffusion/reverse process는 model 자체로 정의되고, denoising neural network $\theta$를 통해 data distribution을 학습함

- Multi-task Framework

Diffusion parameterization technique은 크게 2가지 범주로 구분할 수 있음:
1. Gradient-based Method
  - DDPM과 같이 data log-density의 gradient를 학습하여 data distribution의 intrinsic structure를 얻는 방식
  - 즉, data variation에 대한 micro-mechanism에 초점을 두고 generation process에 대한 granular control을 지원
    - 이를 위해 gradient information을 fine-tuning 하여 true distribution에 progressively approach 함
2. Generator-based Method
  - DiffGAN-TTS와 같이 noisy data에서 clean data로의 transformation function을 직접 학습하여 clean data $x_{0}$를 직접 예측하는 방식
  - 즉, overall data structure의 reconstruction에 초점을 두고 더 높은 품질의 sample을 생성 가능
따라서 PL-TTS는 GradTTS를 기반으로 앞선 gradient-based/generator-based diffusion parameterization technique를 모두 통합함
- 먼저 GradTTS의 denoising model은 data로부터 log-density의 gradient를 학습하고 $\epsilon$ space의 variable를 예측하는 gradient-based model과 같음
  1. 이때 mel-spectrogram 예측을 위해 WaveGrad의 U-Net을 사용하여 Stochastic Differential Equation (SDE)로 diffusion model을 개선함
  2. 즉, GradTTS의 gradual noisification process는:
    (Eq. 1) $dX_{t} = -\frac{1}{2}X_{t}\beta_{t}dt +\sqrt{\beta_{t}}dW_{t}$
    - $t$ : time step, $X_{t}$ : time $t$에서의 state
    - $\beta_{t}$ : time $t$에서의 noise level, $dW_{t}$ : Wiener process의 increment
  3. 그리고 GradTTS는 sampling process에서 reverse SDE에 대한 discrete version을 사용함:
    (Eq. 2) $X_{t-\frac{1}{N}}=X_{t}+\frac{\beta_{t}}{N}\left(\frac{1}{2}X_{t}+\nabla_{x_{t}}\log p_{t}(X_{t})\right)+\sqrt{\frac{\beta_{t}}{N}}z_{t}$
    - 여기서 data $X_{0}$는 standard Gaussian noise $X_{T}$에서 생성됨
    - $N$ : discretized reverse process의 step 수, $z_{t}$ : standard Gaussian noise
- GradTTS에서 $T$는 1로 설정되므로 step size는 $\frac{1}{N}$이고, $t$는 $\{\frac{1}{N},\frac{2}{N},...,1\}$에 속하고, $\mu$는 speaker $s$에 대해 condition 된 phoneme-related Gaussian mean이라고 하자
  1. 그러면 다음과 같이 gradient-based loss를 얻을 수 있음:
    (Eq. 3) $\mathcal{L}_{diff}=\mathbb{E}_{X_{t},t}\left[\lambda_{t}\mathbb{E}_{\xi_{t}}\left|\left| \epsilon_{\theta}(X_{t},t,\mu_{t},s)+\sqrt{\xi_{t}\lambda_{t}} \right|\right|^{2}\right]$
    - $X_{0}$ : target mel-spectrogram sample, $t$ : $[0,T]$ 내의 uniform distribution에서 discrete interval로 sampling 됨
  2. 여기서 $\xi\sim \mathcal{N}(0,I)$이고, $\lambda_{t}=1^{-e^{-\int_{0}^{t}\beta_{s}ds}}$는 time step $t$에 따라 noise level을 modulate 함
- BUT, 해당 gradient-centric approach는 synthesized data의 overarching consistency와 structural characteristic을 capture 하는데 한계가 있음
  1. 따라서 이를 위해 generative principle에 기반한 loss를 도입함
    - 대표적으로 Structural Similarity Index (SSIM)은 strcutral, textural information을 capture 하는데 유용
    - 이때 SSIM metric은 $[0,1]$ range를 가지고 $1$은 impeccable perceptual quality를 의미
  2. 결과적으로 논문은 ProDiff를 따라 다음의 loss를 training에 결합함:
    (Eq. 4) $\mathcal{L}_{SSIM}=1-\text{SSIM}(\hat{X}_{0},X_{0})$
    - $\hat{X}_{0}$ : 생성된 mel-spectrogram, $X_{0}$ : target mel-spectrogram
  3. 그러면 PL-TTS의 final training objective는:
    (Eq. 5) $\mathcal{L}=\gamma_{1}\mathcal{L}_{dur}+\gamma_{2}\mathcal{L}_{diff}+ \gamma_{3}\mathcal{L}_{SSIM}+\gamma_{4}\mathcal{L}_{prior}+\gamma_{5}\mathcal{L}_{1}$
    - $\mathcal{L}_{1}$ : $L1$ loss, $\mathcal{L}_{prior}$ : GradTTS의 prior loss
    - $\gamma_{1}=\gamma_{2}=\gamma_{3}=\gamma_{4}=\gamma_{5}=1$ : weight

- Large Language Model Enhancement

Style prompt processing은 natural language를 통해 describe 되는 style, speaking rate 등을 추출하는 것을 목표로 함
- 일반적으로는 BERT, RoBERTa와 같은 pre-trained natural language processing model을 사용하여 style prompt에서 style feature를 추출함
  1. 이때 pre-trained BERT input sequence 앞에 $[\text{CLS}]$ token을 붙여 사용함
  2. 그러면 $[\text{CLS}]$ token에 해당하는 hidden vector는 style representation으로 사용되어 speech synthesis를 guide 함
- BUT, BERT를 사용하면 prompt data가 제한적일 때 generalizability의 한계가 있으므로, 논문은 LLM인 Llama2를 도입하여 기존 BERT를 대체함
  1. 먼저 fine-tuning training에서 LoRA를 사용하여 auxiliary classification task에 대한 Llama2 model을 prompt-tuning 해 gender, pitch, speaking rate, volume information 등의 style prompt를 예측하도록 함
  2. Fine-tuning 이후 style prompt를 아래 그림과 같이 Llama2의 standardized prompt format으로 처리하고, word embedding으로 변환함
  3. 다음으로 fine-tuned model에 전달하여 last hidden layer의 last token을 얻음
    - 해당 last token은 style prompt의 embedding representation으로 사용됨
- 결과적으로 해당 방식을 통해 style encoder의 unseen data processing을 향상하여 PL-TTS의 generalizability를 개선함

3. Experiments

- Settings

Dataset : LibriTTS, PromptSpeech
Comparisons : PromptTTS, GradTTS
Variants
- P1 : GradTTS+BERT+$\mathcal{L}_{1}$+$\mathcal{L}_{SSIM}$
- P2 : GradTTS+Llama2
- P3 : GradTTS+Llama2 (with Style Embedding)
- P4 (Proposed) : GradTTS+Llama2 (with Style Embedding)+$\mathcal{L}_{1}$+$\mathcal{L}_{SSIM}$

- Results

전체적으로 PL-TTS의 성능이 가장 뛰어남

Unseen data에 대한 generalizability 측면에서도 Llama2를 사용하는 PL-TTS가 가장 뛰어남

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech (0)	2024.11.09
[Paper 리뷰] MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech (0)	2024.10.19
[Paper 리뷰] ClariTTS: Feature-ratio Normalization and Duration Stabilization for Code-Mixed Multi-Speaker Speech Synthesis (0)	2024.10.09
[Paper 리뷰] VoiceTailor: Lightweight Plug-In Adapter for Diffusion-based Personalized Text-to-Speech (0)	2024.10.03
[Paper 리뷰] UnitSpeech: Speaker-Adaptive Speech Synthesis with Untranscribed Data (0)	2024.10.01

최근에 올라온 글

최근에 달린 댓글

« 2025/12 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] PL-TTS: A Generalizable Prompt-based Diffusion TTS Augmented by Large Language Model

PL-TTS: A Generalizable Prompt-based Diffusion TTS Augmented by Large Language Model

1. Introduction

2. Method

- Multi-task Framework

- Large Language Model Enhancement

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바