[Paper 리뷰] ZET-Speech: Zero-Shot Adaptive Emotion-Controllable Text-to-Speech with Diffusion and Style-based Models

티스토리 뷰

Paper/TTS

[Paper 리뷰] ZET-Speech: Zero-Shot Adaptive Emotion-Controllable Text-to-Speech with Diffusion and Style-based Models

feVeRin 2024. 3. 25. 10:13

ZET-Speech: Zero-Shot Adaptive Emotion-Controllable Text-to-Speech Synthesis with Diffusion and Style-based Models

Emotional Text-to-Speech는 natural 하고 emotional한 음성을 합성할 수 있음
BUT, 기존 방식들은 unseen speaker에 대한 generalization 없이 seen speaker만을 대상으로 함
ZET-Speech
- 짧은 speech segment와 target emotion label을 사용하여 any-speaker zero-shot adaptive text-to-speech 수행
- Zero-shot adaptive model이 emotional speech를 합성할 수 있도록 diffusion model에 대한 domain adversarial learning과 guidance method를 도입
논문 (INTERSPEECH 2023) : Paper Link

1. Introduction

Text-to-Speech (TTS)는 training 중에 제공되는 seen speaker 외에도 unseen speaker로 확장할 수 있어야 함
- 이를 위해 style-based generator, diffusion model을 기반으로 하는 zero-shot adaptive TTS 모델이 제안됨
- BUT, 기존의 zero-shot adaptive TTS는 desired emotion으로 음성을 합성할 때 어려움이 있음
  - 일반적인 zero-shot adaptive TTS는 emotion-control을 고려하지 않고, 반대로 emotional TTS는 zero-shot 환경을 고려하지 않기 때문

-> 그래서 zero-shot 환경에서 emotional TTS가 가능한 ZET-Speech를 제안

ZET-Speech
- Zero-shot adaptive emotion-controllable TTS는 training 단계에서는 emotion label이 존재하는 multiple speaker의 emotional speech를 사용할 수 있지만, 추론 단계에서는 target speaker의 neutral speech만을 사용함
- 이때 style-based generator의 style vector는 emotion feature와 speaker identitiy와 highly entangle 되어 있음
- 따라서 이를 해결하기 위해
  1. Domain adversarial training을 도입하여 style vector와 emotional content를 disentangle 함
  2. 추론 시에는 guidance method를 적용하여 emotion label에 따라 target speaker의 emotional speech를 합성

< Overall of ZET-Speech >

Any speaker의 emotion condition과 reference speech를 고려하여 emotional speech를 생성하는 zero-shot adaptive emotion-controllable TTS 모델
Style vector에 대한 domain adversarial training과 diffusion model에 대한 guidance method를 활용해 emotional expression을 향상
결과적으로 기존 모델에 비해 더 우수한 zero-shot emotional TTS 성능을 달성

2. Method

Zero-shot adaptive emotion-controllable TTS는 text (phonemes) $x$, reference speech $\mathbf{Y}$, emotion label $e$가 주어지면, target emotion을 가지는 emotional speech $\hat{\mathbf{Y}}$의 mel-spectrogram을 생성함
- 이를 위해 ZET-Speech는 style-based generator와 diffusion model을 기반으로 함
- 이때 ZET-Speech는 아래의 2가지 approach를 반영해 성능을 향상
  1. Training 시 style-based generator에 대한 domain adversarial training
  2. 추론 시 diffusion model에 대한 guidance method

- Preliminary: Style-based Generator and Diffusion

Style-based generator는 transformer encoder $f_{\theta}$와 mel-style encoder $h_{\psi}$ 2가지로 구성됨
- Mel-style encoder $h_{\psi}$는 reference speech $\mathbf{Y}$의 mel-spectorgram을 input으로 style vector $s=h_{\psi}(\mathbf{Y})$를 output 함
  - Style vector $s$는 any speaker의 reference speech를 latent space에 embedding 하는 역할
- Transformer model $f_{\theta}$는 phoneme sequence $x$와 style vector $s$를 input으로 하여 reconstructed mel-spectrogram $\mu = f_{\theta}(x,s)$를 output 함
이때 generator의 $\mu$를 기반으로 하는 diffusion model을 사용하여 high-fidelity의 음성을 합성할 수 있음
- $\mu$가 주어지면, reverse diffusion process는 Gaussian 분포 $\mathbf{Y}_{T}\sim\mathcal{N}(\mu,I)$에서 추출된 noise를 denoising 하여 음성을 생성함
- 여기서 reverse diffusion의 differential equation은:
  (Eq. 1) $d\mathbf{Y}_{t}=\left( \frac{1}{2}(\mu-\mathbf{Y}_{t})-\nabla_{\mathbf{Y}_{t}}\log p(\mathbf{Y}_{t})\right)\beta_{t}dt$
- 이때 neural network $\epsilon_{\phi}(\mathbf{Y}_{t},t,\mu,s)$는 timestep $t$에 대해 noisy data 분포 $\nabla_{\mathbf{Y}_{t}}\log p_{t}(\mathbf{Y}_{t})$의 score function을 추정
  - $\mathbf{Y}_{t}$ : noisy data, $\mu$ : noise prior, $s$ : style vector
- Emotional-controllable TTS를 위해 hierarchical transformer encoder $f_{\theta}(x, s, e)$와 score estimator $\epsilon_{\phi}(\mathbf{Y}_{t},t,\mu,s,e)$ 모두에 emotion vector $e$를 추가적으로 input 함

- Domain Adversarial Training

일반적으로 emotional information을 TTS 모델에 incorporating 하기 위해, emotion vector를 style vector에 더한 다음, 결합된 vector를 style-based generator로 전달하는 방식을 사용함
- BUT, 해당 방식은 style vector $s$에 speaker identity와 emotional feature가 모두 포함되어 있으므로 효과적이지 않음
  - 이와 같이 naive 하게 emotion vector와 style vector를 결합하는 방식은, emotional feature를 disentangle 하지 못하므로 sub-optimal 한 성능으로 이어짐
- 따라서 Domain Adversarial Training (DAT)을 도입하여 mel-style encoder가 reference speech의 emotional feature를 style vector로부터 disentangle 하도록 함
  1. Style vector $p(e|s;\lambda)$가 주어지면, 음성의 emotion label을 예측하도록 training 된 emotion classifier $g_{\lambda}$를 사용함
  2. 이때 hyperparameter $-\alpha$만큼 gradient를 scale 하도록 emotion classifier 앞에 gradient reversal layer를 삽입함
  3. 결과적으로 emotion label $\hat{e}$를 가지는 emotion classification loss $\mathcal{L}_{e}=-\log p(\hat{e}|s;\lambda)$가 주어졌을 때, emotion classification loss $\mathcal{L}_{e}$에 대한 mel-style encoder parameter $\psi$의 gradient는 $-\alpha\frac{\partial \mathcal{L}_{e}}{\partial \psi}$가 됨
- DAT를 사용함으로써 mel-style encoder는 음성을 style vector에 embed 했을 때, disentangle 된 information을 가짐
  - 즉, 더 적은 emotional information을 포함

- Guidance Methods on the Diffusion Model

Style vector에 대한 DAT 덕분에 제공된 emotion label을 condition으로 한 emotional speech를 합성할 수 있음
- 이때 TTS 모델이 주어진 emotion에 따라 더욱 emotional한 음성을 생성하도록 diffusion model을 도입함
  - Diffusion model이 better-conditioned data를 생성할 수 있도록 guidance method를 고려할 수 있음
  - Classifier guidance, Classifier-free gudiacne
- ZET-Speech의 diffusion model이 주어진 condition에 의해 control 되는 emotional speech를 생성할 수 있도록 guidance method를 결합함
- 먼저 Classifier guidance를 위해,
  1. Emotion-unconditional score estimator $\epsilon_{\phi}(\mathbf{Y}_{t},t,\mu,s)$를 train 하고, 이후에 noisy mel-spectrogram $p(e| \mathbf{Y}_{t})$에 대한 emotion classifier를 train 함
  2. 다음으로 classifier gradient를 sampling process에 추가함:
    (Eq. 2) $d\mathbf{Y}_{t}=\left(\frac{1}{2}(\mu-\mathbf{Y}_{t})-\nabla_{\mathbf{Y}_{t}}\log p(\mathbf{Y}_{t}|e)\right)\beta_{t}dt, \,\,\, \nabla_{\mathbf{Y}_{t}}\log p(\mathbf{Y}_{t}|e)=\gamma*\nabla_{\mathbf{Y}_{t}}\log p(e| \mathbf{Y}_{t})+\nabla_{\mathbf{Y}_{t}}\log p(\mathbf{Y}_{t})$
    - $e$ : emotion label, $\gamma$ : guidance control을 위한 hyperparameter
- Classifier-free guidance의 경우, training 시 emotion embedding을 null embedding $\varnothing$으로 randomly replace 함
  1. 이때 $\mu=f_{\theta}(x, s,e)$에도 emotional information이 포함되어 있음을 고려해야 함
    - 따라서 unconditional noise를 추정하기 위해서는 $\mu_{\varnothing}$도 전달해야 함
  2. 결과적으로 null embedding을 transformer encoder로 전달하여 $\mu_{\varnothing} =f_{\theta}(x,s,\varnothing)$을 생성하고
  3. Conditional, unconditional score estimation을 결합하여 classifier-free guidance로 sampling을 수행함:
    (Eq. 3) $\hat{e}_{\phi}(\mathbf{Y}_{t},t,\mu,s,e)=\epsilon_{\phi}(\mathbf{Y}_{t},t,\mu,s,e)+\gamma(\epsilon_{\phi}(\mathbf{Y}_{t},t,\mu,s,e)-\epsilon_{\phi}(\mathbf{Y}_{t},t,\mu_{\varnothing},s,\varnothing))$
    - $\gamma$ : guidance control을 위한 hyperparameter
  4. 최종적으로 (Eq. 3)의 score estimation을 사용하여 (Eq. 1)의 differential equation을 다음과 같이 수정함:
    (Eq. 4) $d\mathbf{Y}_{t}=\left(\frac{1}{2}(\mu-\mathbf{Y}_{t})-\hat{\epsilon}_{\phi}(\mathbf{Y}_{t},t,\mu,s,e)\right)\beta_{t}dt$

3. Experiments

- Settings

Dataset : Korean Speech Dataset, LibriTTS
Comparisons : Grad-StyleSpeech

- Results

Korean Speech Dataset에서 seen speaker에 대한 정량적 성능을 비교해 보면, ZET-Speech가 가장 뛰어난 성능을 보임
- 특히 DAT는 모델의 emotional speech 생성 능력을 크게 향상함
  - 즉, DAT를 사용하지 않으면 emotion condition에 대한 sensitivity가 낮음
- Classifier Guidance (CG)와 Classifier-Free Guidance (CFG) 모두 성능 향상이 가능함
  - CG를 사용하면 CFG 보다 더 높은 성능을 보이지만, CFG는 추가적인 classifier 없이 global style token과 같은 다른 방식과 결합될 수 있다는 장점이 있음

Unseen Speaker에 대한 성능도 마찬가지로 ZET-Speech가 가장 우수한 것으로 나타남

LibriTTS에서도 ZET-Speech가 우수한 성능을 보임

MOS를 통한 주관적 성능 평가를 살펴보면,
- 앞선 결과와 마찬가지로 emotion을 전달하는 데 있어 ZET-Speech가 더 효과적인 것으로 나타남
- MOS 측면에서는 CFG를 사용하는 것이 더 natural 하고 emotional한 음성을 생성하는 것으로 나타남

Effect of Domain Adversarial Training
- t-SNE를 사용하여 training set에서 sampling 된 speaker의 style vector를 plotting 해보면
- DAT 없이 training 된 모델의 style vector는 명확한 cluster가 존재하여, 서로 entangle 되어 있음을 보임
- DAT를 사용한 style vector는 emotion 전체에 evenly distribute 되어 있어 성공적으로 disentangle 됨을 보임

Spectrum Visualization
- 서로 다른 emotion에 대한 합성된 음성의 mel-spectrogram을 비교해 보면
- Happy, Angry에 대해서는 일반적으로 Sad emotion에 비해 더 높은 pitch를 가지는 것으로 나타남
  - 즉, ZET-Speech는 다양한 emotional condition에 대해 distinct variation을 반영할 수 있음

Guidance Scale
- Guidance scale은 합성 품질과 emotional expressiveness 모두에 영향을 미침
- Guidance scale을 늘리면 Emotional Classification Accuracy (ECA), Character Error Rate (CER)이 높아짐
  - 즉, guidance scale이 높을수록 emotional expressivity가 향상되지만, 품질은 저하됨

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] HierSpeech: Bridging the Gap between Text and Speech by Hierarchical Variational Inference using Self-Supervised Representations for Speech Synthesis (0)	2024.03.29
[Paper 리뷰] Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech (0)	2024.03.28
[Paper 리뷰] JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech (0)	2024.03.24
[Paper 리뷰] StyleTTS2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models (0)	2024.03.17
[Paper 리뷰] P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting (0)	2024.03.16

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] ZET-Speech: Zero-Shot Adaptive Emotion-Controllable Text-to-Speech with Diffusion and Style-based Models

ZET-Speech: Zero-Shot Adaptive Emotion-Controllable Text-to-Speech Synthesis with Diffusion and Style-based Models

1. Introduction

2. Method

- Preliminary: Style-based Generator and Diffusion

- Domain Adversarial Training

- Guidance Methods on the Diffusion Model

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바