[Paper 리뷰] E3-TTS: Easy End-to-End Diffusion-based Text to Speech

티스토리 뷰

Paper/TTS

[Paper 리뷰] E3-TTS: Easy End-to-End Diffusion-based Text to Speech

feVeRin 2025. 6. 26. 17:02

E3-TTS: Easy End-to-End Diffusion-based Text to Speech

End-to-End diffusion-based Text-to-Speech model을 활용하여 high-fidelity speech를 얻을 수 있음
E3-TTS
- Plain text를 input으로 하여 iterative refinement process를 통해 waveform을 생성
- 특히 spectrogram feature, alignment information과 같은 intermediate representation에 의존하지 않음
논문 (ASRU 2023) : Paper Link

1. Introduction

WaveGrad, DiffWave 등과 같이 Text-to-Speech (TTS) system에 Diffusion Model을 도입하면 high-fidelity speech를 얻을 수 있음
- 일반적으로 TTS model은 intermediate representation을 생성하는 generator와 해당 intermediate representation에서 audio를 predict 하는 vocoder로 구성된 two-stage process를 활용함
  - 이때 TTS model은 text를 phoneme, grapheme과 같은 input unit으로 convert 하여 사용함
- 한편으로 text로부터 audio를 end-to-end generate 하기 위해서는 waveform의 strong temporal dependency를 modeling 할 수 있어야 함
  1. 이를 위해 phoneme과 같은 individual input unit과 generated audio의 output sample 간의 mapping을 제공하는 alignment information을 도입할 수 있음
  2. BUT, alignment information을 추출하기 위해서는 external aligner나 complex pipeline이 필요함

-> 그래서 text를 input으로 directly take하는 end-to-end diffusion-based TTS model인 E3-TTS를 제안

E3-TTS
- Pre-trained BERT model을 통해 text를 input으로 사용하여 information을 추출
- BERT representation에 attend하고 audio를 predict 하는 U-Net structure 기반의 diffusion process를 적용

< Overall of E3-TTS >

BERT를 사용해 text에서 audio를 end-to-end generate 하는 diffusion-based TTS model
결과적으로 기존보다 우수한 성능을 달성

2. Method

E3-TTS는 다음 2가지 module로 구성됨:
- Text에서 information을 추출하는 pre-trained BERT model
- BERT output을 attend하고 noisy waveform을 iteratively refining 하여 raw waveform을 predict 하는 diffusion U-Net model

- BERT Model

먼저 E3-TTS는 pre-trained BERT model이 제공하는 text representation을 활용함
- 이때 BERT model은 subword를 input으로 사용하고 phoneme, grapheme 등은 사용하지 않음
- 이를 통해 multiple language에 대한 text data 만으로 training 된 pre-trained text language model을 온전히 활용할 수 있으므로 process를 simplify 할 수 있음

- Diffusion

E3-TTS는 Score Matching과 Diffusion Probabilistic Model을 기반으로 함
- 먼저 score function은 output $y$에 대한 log-conditional distribution $p(y|x)$의 gradient로 정의됨:
  (Eq. 1) $s(y|x)=\nabla_{y}\log p(y|x)$
  - $y$ : waveform, $x$ : conditioning signal
- 그러면 score network $s(\tilde{y}|x,\bar{\alpha})$는 model prediction과 ground-truth $\epsilon$ 간의 distance를 minimizing 하는 scaled derivative를 predict 하도록 training 됨:
  (Eq. 2) $\mathbb{E}_{\bar{\alpha},\epsilon }\left[\left|\left| \epsilon_{\theta}\left( \tilde{y},x,\sqrt{\bar{\alpha}}\right)-\epsilon\right|\right|_{2}\right]$
  - $\epsilon \sim \mathcal{N}(0,I)$ : reparameterization trick을 위한 noise term, $\bar{\alpha}$ : noise level
- $\tilde{y}$는:
  (Eq. 3) $\tilde{y}=\sqrt{\bar{\alpha}}y_{0}+\sqrt{1-\bar{\alpha}}\epsilon$
- Training 시 $\bar{\alpha}$는 $\beta$의 pre-defined linear schedule에 대한 interval $[\bar{\alpha}_{n},\bar{\alpha}_{n+1}]$에서 sampling 됨:
  (Eq. 4) $\bar{\alpha}_{n}:=\prod_{s=1}^{n}(1-\beta_{s})$
- 각 iteration에서 updated waveform은 다음의 stochastic process를 따라 estimate 됨:
  (Eq. 5) $y_{n-1}=\frac{1}{\sqrt{\alpha_{n}}}\left(y_{n}-\frac{\beta_{n}}{\sqrt{1-\bar{\alpha}_{n}} } \epsilon_{\theta}\left(y_{n},x,\sqrt{\bar{\alpha}_{n}}\right)\right)+\sigma_{n}z$
- 논문에서는 convergence를 지원하고 $\epsilon$ loss magnitude를 scale 하기 위해 KL loss를 도입함
  1. 추가적으로 model은 timestep에 따라 $L2$ loss의 variance $\omega(\alpha)$를 predict 함
  2. 특히 KL loss는 sampling 된 서로 다른 timestep에 대한 loss의 weight를 adjust 함:
    (Eq. 6) $\mathbb{E}_{\bar{\alpha},\epsilon}\left[\frac{1}{\omega(\bar{\alpha})} \left|\left| \epsilon_{\theta}\left(\tilde{y},x,\sqrt{\bar{\alpha}}\right)-\epsilon\right|\right|_{2} +\ln\left(\omega(\bar{\alpha})\right)\right]$

- U-Net

논문은 residual로 connect 된 downsampling/upsampling block을 가지는 1D U-Net을 활용함
- 특히 autoregressive TTS approach를 따라 top downsampling/upsampling block에 BERT output으로부터 information을 추출하는 cross-attention을 적용함
  1. Low downsampling/upsampling block에는 timestep, speaker에 따라 결정되는 adaptive softmax CNN kernel을 적용함
  2. 나머지 layer에는 speaker, timestep embedding이 FiLM을 통해 join 됨
    - FiLM은 channel-wise scaling, bias를 predict 하는 combined layer를 가짐
- Downsampler는 noise information을 encoded BERT output과 similar 한 length의 sequence로 refine 하고, upsampler는 input waveform과 동일한 length의 noise를 predict 함
- Training 시에는 waveform length를 10.92s로 fix 하고 waveform end에 0을 padding 함
  - 이때 각 padding frame의 weight를 non-padding frame의 $\frac{1}{10}$으로 설정함
- 추론 시에는 output waveform length를 fix 하고 padding part를 distinguish 하기 위해, 1024 sample 당 average magnitude를 calculate 하고 $\leq 0.02$ part를 cutoff 하여 사용함

3. Experiments

- Settings

Dataset : English Dataset (internal)
Comparisons : Tacotron, WaveTacotron

- Results

전체적으로 E3-TTS의 성능이 가장 뛰어남

Waveform Prompt-based TTS & Text-based Speech Editing
- Prompt-based TTS, Speech Editing 모두에서 E3-TTS는 우수한 성능을 달성함

Speaker Similarity
- 더 많은 timestep에서 sampling을 수행할수록 더 나은 speaker similarity를 얻을 수 있음

Sample Diversity
- 다음과 같이 compute 되는 Frechet Speaker Distance (FSD)를 사용하여 sample diversity를 비교해 보면:
  (Eq. 7) $ \text{FSD}_{A,B}=||\mu_{A}-\mu_{B}||^{2}+\text{Tr}(C_{A}+C_{B}-2\sqrt{C_{A}*C_{B}})$
  - $\mu$ : output speaker embedding의 mean, $C$ : covariance
- E3-TTS는 기존 baseline에 비해 sample diversity를 크게 개선하는 것으로 나타남

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis (0)	2025.07.04
[Paper 리뷰] OZSpeech: One-Step Zero-Shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching (0)	2025.06.30
[Paper 리뷰] E2-TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS (0)	2025.06.25
[Paper 리뷰] F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching (0)	2025.06.23
[Paper 리뷰] TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer (0)	2025.06.20

최근에 올라온 글

최근에 달린 댓글

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] E3-TTS: Easy End-to-End Diffusion-based Text to Speech

E3-TTS: Easy End-to-End Diffusion-based Text to Speech

1. Introduction

2. Method

- BERT Model

- Diffusion

- U-Net

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바