[Paper 리뷰] E3-VITS: Emotional End-to-End TTS with Cross-Speaker Style Transfer

티스토리 뷰

Paper/TTS

[Paper 리뷰] E3-VITS: Emotional End-to-End TTS with Cross-Speaker Style Transfer

feVeRin 2025. 5. 23. 13:22

E3-VITS: Emotional End-to-End TTS with Cross-Speaker Style Transfer

Emotional text-to-speech를 위해서는 high labeling cost가 필요함
E3-VITS
- Disjoint dataset을 활용해 cross-speaker emotion transfer를 지원
- Cross-speaker emotion transfer quality를 향상하는 batch-permuted style perturbation을 도입
논문 (ICML 2023) : Paper Link

1. Introduction

Human-like Text-to-Speech (TTS)를 위해서는 emotion modeling을 향상해야 함
- Naive approach로써 labeled emotional speech corpora를 활용하는 것을 고려할 수 있음
  - BUT, speaker-emotion pair에 대한 large-scale corpora가 필요함
- Period VITS와 같은 cross-speaker emotion transfer model 역시 predefiend categorical label에 의존함
  - 이때 reference speech에서 emotional feature를 추출하는 reference encoder를 도입할 수도 있음
- 한편으로 categorical label과 reference encoder를 combine 하기 위해 language model을 활용할 수 있음
  - BUT, 대부분 sequential training과 vocoder fine-tuning이 필요한 two-stage model에 의존함

-> 그래서 end-to-end emotional TTS model인 E3-VITS를 제안

E3-VITS
- Text representation을 speaker, emotion feature와 disentangle 하는 Domain Adversarial Training (DAT)를 도입
- Cross-speaker emotion transfer를 향상하기 위해 Batch-Permuted Style Perturbation을 적용

< Overall of E3-VITS >

Reference speech와 textual emotional description을 모두 활용하는 end-to-end emotional TTS model
결과적으로 기존보다 우수한 성능을 달성

2. Method

E3-VITS는 VITS architecture를 기반으로 구축됨
- 이때 emotion control을 위해 style embedding과 style encoder를 도입하고, entanlgement를 방지하기 위해 domain classifier를 활용함
- Cross-speaker emotion transfer quality를 향상하기 위해 batch-permuted style perturbation으로 adversarial network training을 modify 함
- 추가적으로 stable duration prediction을 위해 stochastic duration predictor를 FastSpeech2의 deterministic duration predictor로 replace 함
  - 이를 통해 robustness와 accurate phoneme duration prediction을 보장함

- Style Embedding

E3-VITS는 emotional information을 style embedding을 통해 incorporate 함
- 해당 style embedding은 speaker embedding과 동일한 256-dimension을 가지고 style encoder를 통해 얻어짐
  - Style embedding은 speaker embedding과 함께 각 module에 input 됨
- E3-VITS는 각 training iteration에서 reference speech, style tag로부터 style embedding을 equal probability로 randomly select 함

- Domain Adversarial Training

Text representation $h_{text}$에서 speaker information을 separate 하기 위해, 논문은 SANE-TTS와 같이 speaker reversal classifier를 integrate 하여 Domain Adversarial Training (DAT)를 적용함
- 추가적으로 emotional information을 text representation에서 remove 하기 위해 style tag를 classify 하는 style reversal classifier를 도입함
- 이때 fine-grained label인 style tag의 complexity를 avoid 하기 위해 general class인 emotional category를 coarse-grained label로 채택하여 expressive speech를 생성함
  1. Speaker, style classifier는 fully connected layer와 gradient reversal layer로 구성되고 hidden text representation $h_{text}$를 input으로 사용함
  2. 한편으로 speaker information의 potential leakage를 방지하기 위해, style encoder의 style embedding을 input으로 사용하는 speaker classifier를 추가함

- Batch-Permuted Style Perturbation

논문은 dataset에서 unpaired speaker, emotion으로 synthesize 된 speech의 quality를 향상하기 위해 Batch-Permuted Style Perturbation을 도입함
- 먼저 VITS의 voice conversion process를 따라 batch 내의 latent variable $z$의 style embedding을 perturb 함
  1. 그러면 unpaired set의 style-perturbed latent variable $\tilde{z}$는:
    (Eq. 1) $\tilde{z}=f(f^{-1}(z|s,e)|s,\tilde{e})$
    - $f$ : normalizing flow, $z$ : paired set의 latent variable, $s,e$ : 각각 speaker, style embedding, $\tilde{e}$ : training set에서 $s$의 speaker와 pair 되지 않은 style embedding
  2. 각 batch마다 randomly permuted paired style embedding에서 unpaired style embedding이 생성됨
    - Stop gradient operator는 other module에 대한 flow의 영향을 limit 하기 위해 forward process 이후에 적용됨
  3. $\tilde{z}$를 얻은 다음, $z$가 $\tilde{z}$에 concatenate 되어 decoder에 전달됨
    - Discriminator에서는 paired ground-truth audio sample이 unpaired set의 generated audio sample을 위한 ground-truth로써 사용됨
- 결과적으로 E3-VITS의 discriminator adversarial loss $\mathcal{L}_{adv}^{D}$와 generator adversarial loss $\mathcal{L}_{adv}^{G}$는:
  (Eq. 2) $\mathcal{L}_{adv}^{D}=\mathbb{E}_{y,z}\left[(D(y)-1)^{2}+D(G(z))^{2}\right]+\mathbb{E}_{y,\tilde{z}} \left[(D(y)-1)^{2}+D(G(\tilde{z}))^{2}\right]$
  (Eq. 3) $\mathcal{L}_{adv}^{G}=\mathbb{E}_{z}\left[(D(G(z))-1)^{2}\right]+\mathbb{E}_{\tilde{z}}\left[(D(G(\tilde{z}))-1)^{2}\right]$
  - $G$ : decoder, $D$ : discriminator, $y$ : ground-truth waveform
- 여기서 feature matching loss는 VITS의 adversarial training에서 사용되므로, 논문은 style perturbation을 고려하기 위해 modified feature matching loss $\mathcal{L}_{fm}^{G}$를 고려함:
  (Eq. 4) $\mathcal{L}_{fm}^{G}=\mathbb{E}_{y,z}\left[\sum_{l=1}^{T}\frac{1}{N_{l}}|| D^{l}(y)-D^{l}(G(z))||\right]+\mathbb{E}_{y,\tilde{z}}\left[\sum_{l=1}^{T}\frac{1}{N_{l}}|| D^{l}(y)-D^{l}(G(\tilde{z}))||\right]$
  - $T$ : discriminator의 total layer 수, $D^{l}(x)$ : input $x$이 주어졌을 때 discriminator $l$-th layer의 $N_{l}$ feature에 대한 output feature map

- Objective Function

Emotional feature를 학습하기 위해 3가지 style loss로 구성된 loss를 활용함
- 여기서 style loss $\mathcal{L}_{style}$는 다음과 같음:
  (Eq. 5) $\mathcal{L}_{style}=\alpha\mathcal{L}_{s}^{emb}+\beta\mathcal{L}_{s}^{rec}+\gamma\mathcal{L}_{s}^{con}$
  - $\mathcal{L}_{s}^{emb}, \mathcal{L}_{s}^{rec},\mathcal{L}_{s}^{con}$ : 각각 style embedding loss, style reconstruction loss, style contrastive loss
  - $\alpha=\beta=45, \gamma=0.45$
- Weight의 ratio은 $Y$ setting에 따라 consistent 하게 setting 되고, $\mathcal{L}_{s}^{con}$의 temperature는 stable training을 위해 $1.0$으로 fix 됨
- 결과적으로 얻어지는 discriminator의 training objective $\mathcal{L}_{total}^{D}$와 generator의 training objective $\mathcal{L}_{total}^{G}$는:
  (Eq. 6) $\mathcal{L}_{total}^{D}=\mathcal{L}_{adv}^{D}$
  (Eq. 7) $\mathcal{L}_{total}^{G}=\mathcal{L}_{recon}+\mathcal{L}_{kl}+\mathcal{L}_{dur}+\mathcal{L}_{adv}^{G}+\mathcal{L}_{fm}^{G}+\mathcal{L}_{style}+\lambda \mathcal{L}_{dat}$
  - $\mathcal{L}_{recon},\mathcal{L}_{kl}, \mathcal{L}_{dur}$ : VITS를 따름
  - $\mathcal{L}_{dat}$ : DAT의 classification loss, $\lambda$ : scale factor

3. Experiments

- Settings

Dataset : FSNR0
Comparisons : Tacotron2

- Results

전체적으로 E3-VITS의 성능이 가장 우수함

RTF 측면에서도 $11.4\times$의 inference speed 향상을 보임

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] ZCS-CDiff: A Zero-Shot Code-Switching TTS System with Conformer-Based Diffusion Model (0)	2025.05.28
[Paper 리뷰] MB-iSTFT-VITS: Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform (0)	2025.05.27
[Paper 리뷰] InstantSpeech: Instant Synchronous Text-to-Speech Synthesis for LLM-driven Voice Chatbots (0)	2025.05.20
[Paper 리뷰] DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech (0)	2025.05.15
[Paper 리뷰] Factorized-VITS: Decoding Prosody and Text in End-to-End Speech Synthesis without External or Secondary Aligner (0)	2025.05.13

최근에 올라온 글

최근에 달린 댓글

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] E3-VITS: Emotional End-to-End TTS with Cross-Speaker Style Transfer

E3-VITS: Emotional End-to-End TTS with Cross-Speaker Style Transfer

1. Introduction

2. Method

- Style Embedding

- Domain Adversarial Training

- Batch-Permuted Style Perturbation

- Objective Function

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바