[Paper 리뷰] PromptTTS++: Controlling Speaker Identity in Prompt-based Text-to-Speech using Natural Language Descriptions

티스토리 뷰

Paper/TTS

[Paper 리뷰] PromptTTS++: Controlling Speaker Identity in Prompt-based Text-to-Speech using Natural Language Descriptions

feVeRin 2024. 4. 11. 09:10

PromptTTS++: Controlling Speaker Identity in Prompt-based Text-to-Speech using Natural Language Descriptions

Natural language description을 사용하여 speaker identity를 control 하는 prompt-based text-to-speech를 수행할 수 있음
PromptTTS++
- Speaker identity를 control 하기 위해, speaking style과 independent 하도록 설계된 voice characteristic을 설명하는 speaker prompt를 도입
- Diffusion-based acoustic model을 사용하여 다양한 speaker factor를 모델링함
- Speaker individuality에 대한 제한적인 style prompt에 의존하는 기존 방식과는 달리, 추가적인 speaker prompt를 활용하여 다양한 speaker에 대한 acoustic feature mapping을 효과적으로 학습함
논문 (ICASSP 2024) : Paper Link

1. Introduction

Text-to-Speech (TTS)에서 prompt와 같은 natural language description을 사용하여 speaking style 등을 control 하는 controllable TTS가 많은 주목을 받고 있음
- 이를 통해 직관적인 user interface를 제공할 수 있고, Large Language Model (LLM)의 강력한 language understanding capability를 활용하여 flexibility 향상이 가능함
- BUT, 기존의 prompt-based TTS 모델은 speaker identity에 대한 controllability가 부족함
  - 대표적으로 PromptTTS는 gender, pitch, emotion 등에 대한 style prompt를 활용하는데, 이러한 style prompt는 utterance의 prosody와 주로 관련되어 있어 speaker identity를 finely control 하기 어려움
- 결과적으로 training data에서 사용된 speaker 이외의 다른 새로운 speaker에 대한 음성을 생성할 수 없으므로 controllability가 크게 제한됨

-> 그래서 text prompt를 활용해 speaker identity를 control 할 수 있는 prompt-based TTS 모델인 PromptTTS++를 제안

PromptTTS++
- PromptTTS를 기반으로 style prompt와 independent 하게 설계된 natural language description을 통해 speaker identity를 설명하는 speaker prompt를 도입
- Global Style Token (GST) 기반 reference encoder에서 추출한 style/speaker embedding 모델링을 위해, Gaussian Mixture Modeling (GMM)을 기반으로 한 Mixture Density Network (MDN)을 사용
  - 이를 통해 text prompt information에 따라 더 다양하고 rich 한 speaker representation을 학습
  - 추가적으로 합성 품질 향상을 위해 diffusion-based acoustic model을 사용

< Overall of PromptTTS++ >

Speaker identity를 control 하기 위해, speaking style과 independent 하도록 설계된 speaker prompt를 도입
Diffusion-based acoustic model과 MDN을 활용하여 다양한 speaker에 대한 representation을 효과적으로 학습함
결과적으로 기존 prompt-based TTS 모델보다 뛰어난 품질과 prompt-to-speech consistency를 달성

2. Method

PromptTTS++는 reference encoder, prompt encoder, acoustic model로 구성됨
- 이를 기반으로 주어진 input text (content prompt)와 speaker/style prompt로부터 output speech를 생성함

- Reference Encoder

Speaker와 관련된 latent acoustic variation을 uncover 하기 위해 GST-based reference encoder를 사용하고, speech signal에 대한 style embedding을 추출함
- GST-based reference encoder는 log scale mel-spectrogram을 input으로 사용하고 fixed-dimensional style embedding을 output 함
- 해당 style embedding은 learned style token의 weighted combination으로써, acoustic model의 conditional feature로 사용됨

- Prompt Encoder

Prompt encoder는 input speaker에서 embedding 된 style을 예측하는 역할
- PromptTTS와 같은 기존 방식들은 pre-trained language model로써 BERT를 채택하여 prompt embedding을 추출함
- 제안하는 PromptTTS++도 마찬가지로 BERT를 prompt encoder의 fundamental building block으로 채택함
  - 이를 위해 speaker/style prompt를 하나로 concatenate 하고, BERT를 사용하여 prompt embedding을 얻은 다음, 3개의 linear layer를 적용

한편으로 training 중에 cosine similarity loss를 사용하여 reference encoder와 prompt encoder에서 예측된 embedding 간의 차이를 최소화할 수 있음
- BUT, cosine similarity를 사용하는 경우, 다양한 speaker를 생성할 수 있는 capability가 크게 제한됨
- 따라서 PromptTTS++는 GMM-based MDN을 사용하여 prompt information이 주어진 embedding의 conditional 분포를 모델링하는 방식을 채택함
  - 이를 위해 앞선 BERT 다음에 MDN layer를 추가하고, output을 GMM의 parameter로 구성함
- MDN을 통해 PromptTTS++는 speaker의 다양한 characteristic을 확률 분포로써 학습할 수 있고, 해당 분포에서 sampling을 통해 새로운 speaker를 생성할 수 있음

- Acoustic Model

Acoustic model은 content prompt와 style embedding으로부터 mel-spectrogram을 생성함
- 이때 acoustic model은 content encoder, variance adaptor, diffusion decoder로 구성됨
  1. 구조적으로, content encoder는 Conformer를 기반으로 함
  2. Variance adaptor는 energy predictor를 사용하지 않는다는 것을 제외하면 FastSpeech2와 동일
    - Variance adaptor는 duration predictor와 pitch predictor로 구성되고, 이때 pitch predictor는 logarithmic fundamental frequency $\log\text{-}F0$와 voiced/unvoiced flag (V/UV)를 예측함
  3. Diffusion decoder는 denoising diffusion probabilistic model을 기반으로 한 mel-spectrogram 생성 모델인 DiffSinger를 사용함
    - Training 중에 diffusion decoder는 mel-spectrogram의 noise를 denoise 하는 방법을 학습
- 여기서 실험적으로 기존 PromptTTS의 transformer-based decoder는 낮은 합성 품질을 보이는 것으로 나타남
  - 이와 달리 PromptTTS++는 diffusion decoder를 통해 합성 품질의 naturalness를 크게 향상할 수 있음
  - 마찬가지로 naturalness 향상을 위해 duration predictor에 MDN layer를 추가함

- Training

PromptTTS++는 다음의 loss function을 최소화하여 training 됨:
(Eq. 1) $L=L_{dec}+L_{dur}+L_{pitch}+L_{style}$
- $L_{dec}, L_{dur}, L_{pitch}, L_{style}$ : 각각 diffusion decoder, duration predictor, pitch predictor, prompt encoder에 대한 loss
- $L_{dur}, L_{style}$에 대해서는 log-likelihood loss를 사용
- $L_{dec}, L_{pitch}$에 대해서는 weighted variational lower bound와 $L1$ loss를 사용
- $L_{pitch}$에는 $\log\text{-}F0$와 V/UV에 대한 2개의 $L1$ loss를 사용
Prompt encoder와 모델의 나머지 부분에 대한 개별적인 training을 위해, $L_{style}$을 최적화할 때, reference encoder의 output에 stop-gradient operation을 적용함

3. Experiments

- Settings

Dataset : LibriTTS-R (with Text Prompt)
Comparisons : PromptTTS

- Results

Naturalness 측면에서 제안된 PromptTTS++는 PromptTTS 보다 뛰어난 성능을 보임
- 마찬가지로 prompt-to-speech consistency 측면에서도 PromptTTS++는 우수한 성능을 달성함
- 실제로 PromptTTS++의 합성 결과를 ground-truth와 비교하여도 거의 동일한 수준을 보임

Style embedding에 대한 t-SNE 결과를 확인해 보면
- PromptTTS++를 사용하면 아래 그림의 (a)와 같이 distinctly clustered embedding을 생성함
- 반면 speaker prompt를 사용하지 않은 경우, 아래의 (b)와 같이 gender에 해당하는 2개의 cluster 만을 생성하여 서로 다른 speaker들을 non-distinguishable 하게 만듦

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt (0)	2024.04.13
[Paper 리뷰] PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions (0)	2024.04.12
[Paper 리뷰] PromptTTS: Controllable Text-to-Speech with Text Descriptions (0)	2024.04.08
[Paper 리뷰] UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding (0)	2024.04.06
[Paper 리뷰] Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-End Emotional Speech Synthesis (0)	2024.04.03

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] PromptTTS++: Controlling Speaker Identity in Prompt-based Text-to-Speech using Natural Language Descriptions

PromptTTS++: Controlling Speaker Identity in Prompt-based Text-to-Speech using Natural Language Descriptions

1. Introduction

2. Method

- Reference Encoder

- Prompt Encoder

- Acoustic Model

- Training

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바