[Paper 리뷰] Diff-TTS: A Denoising Diffusion Model for Text-to-Speech

티스토리 뷰

Paper/TTS

[Paper 리뷰] Diff-TTS: A Denoising Diffusion Model for Text-to-Speech

feVeRin 2023. 12. 19. 11:05

Diff-TTS: A Denoising Diffusion Model for Text-to-Speech

Neural text-to-speech 모델은 여전히 자연스러운 합성과 architecture 효율성이 요구됨
Diff-TTS
- 주어진 text에 대해 denoising diffusion을 활용하여 noise signal을 mel-spectrogram으로 변환
- Text를 condition으로 하는 mel-spectrogram 분포를 학습하기 위한 likelihood-based optimization
- 추론 속도 향상을 위한 accelerated sampling의 도입
논문 (INTERSPEECH 2021) : Paper Link

1. Introduction

대부분의 neural text-to-speech (TTS) 모델은 text를 mel-spectrogram으로 변환하는 TTS 모델과 acoustic feature로부터 waveform을 생성하는 Vocoder로 구성
그중 TTS 모델은 autoregressive (AR) 모델과 non-AR 모델로 분류될 수 있음
- AR 모델 : output 분포를 순차적으로 조건부 분포의 곱으로 factorizing 하여 고품질 sample을 생성
  - Tacotron2, Transformer-TTS는 자연스러운 음성 생성이 가능하지만 추론 시간이 mel-spectrogram 길이에 따라 선형적으로 증가
  - AR 모델은 누적되는 prediction error로 인해 word skipping, repeating과 같은 문제가 발생함
- Non-AR 모델 : AR 모델에 비해 안정적인 합성이 가능하고 추론 속도가 빠름
  - FastSpeech2, SpeedySpeech의 경우 음성 합성의 다양성이 떨어짐
  - Glow-TTS, Flow-TTS의 경우 normalizing flow로 인한 architecture 제약으로 인해 parameter-inefficient함
- Denoising diffusion : AR, non-AR에 비해 최근 이미지 및 음성 합성에서 뛰어난 성능을 보이고 있음
  - Denoising diffusion 모델은 maximum likelihood에 따라 최적화될 수 있고 architecture 선택의 자유가 있음

-> 그래서 denoising diffusion의 장점을 활용할 수 있는 TTS 모델인 Diff-TTS를 제안

Diff-TTS
- Robust 하고 제어가능한 고품질의 non-AR TTS 모델
- Auxiliary loss 없이 학습이 가능한 log-likelihood 기반 최적화
- Markov-chain 제약 하에서 추론 속도를 향상할 수 있는 accelerated sampling method의 도입

< Overall of Diff-TTS >

Non-AR TTS 모델에 denoising diffusion probabilistic model (DDPM)을 최초로 도입
Tacotron2 및 Glow-TTS 보다 더 적은 parameter를 사용하면서 더 높은 fidelity의 오디오를 합성 가능
Accelerated sampling을 통해 빠른 추론 속도를 달성하고, sample 품질과 추론 속도를 간의 trade-off를 제어
Latent space와 additive noise를 통한 pitch variability 조절 및 prosody 제어

2. Diff-TTS

- Denoising Diffusion Model for TTS

Diff-TTS는 noise 분포를 주어진 text에 해당하는 mel-spectrogram 분포로 변환함
Diffusion Process
- Mel-spectrogram이 Gaussian noise를 통해 점진적으로 corrput 되어 latent variable로 변환되는 과정
- $x_{1}, ..., x_{T}$를 diffusion time step index $t=0, 1, ..., T$에서 동일한 차원을 가지는 variable들의 시퀀스라고 하면,
  - Diffusion process는 Markov transition chain을 통해 mel-spectrogram $x_{0}$를 Gaussian noise $x_{T}$로 변환
  - 각 transition step은 variance schedule $\beta_{1}, \beta_{2}, ..., \beta_{T}$로 predefine 됨
- 각 변환은 text $c$와 독립적이라고 가정하는 Markov transition probability $q(x_{t} | x_{t-1}, c)$에 의해 수행됨:
  $q(x_{t} | x_{t-1}, c) = \mathcal{N}(x_{t} ; \sqrt{1-\beta_{t}} x_{t-1}, \beta_{t}I)$
- 전체 diffusion process $q(x_{1:T} | x_{0}, c)$는 Markov process이고, 아래와 같이 factorize 됨:
  $q(x_{1}, ..., x_{T} | x_{0}, c) = \prod^{T}_{t=1} q(x_{t} | x_{t=1})$

Reverse Process
- Diffusion process의 역과정으로, Gaussian noise로부터 mel-spectrogram을 복원하는 과정
- Reverse process는 조건부 분포 $p_{\theta} (x_{0:T-1} | x_{T}, c)$로 정의되고, Markov chain property를 기반으로 여러 개의 transition으로 factorize 됨:
  $p_{\theta}(x_{0}, ..., x_{T-1} | x_{T}, c) = \prod^{T}_{t=1} p_{\theta} (x_{t-1} | x_{t}, c)$
- Reverse transition $p_{\theta} (x_{t-1} | x_{t}, c)$를 통해 latent variable은 text condition과 diffusion time-step에 해당하는 mel-spectrogram으로 복원됨
  - Diff-TTS는 reverse process를 통해 얻은 모델 분포 $p_{\theta}(x_{0}|c)$를 학습하는 것으로 볼 수 있음
Training Objective
- Diff-TTS가 $q(x_{0}|c)$에 잘 근사되기 위해서는 reverse process는 mel-spectrogram의 log-likelihood를 최대화해야 함:
  $\mathbb{E}_{log q(x_{0}|c)} [ log p_{\theta} (x_{0}|c)] $
  - $q(x_{0}|c)$ : mel-spectrogram 분포
  -> 이때 $p_{\theta} (x_{0}|c)$는 intractable 하기 때문에, parameterization trick을 활용하여 closed form의 log-likelihood variational lower bound를 계산
- Diff-TTS의 training objective는:
  $min_{\theta} L(\theta) = \mathbb{E}_{x_{0}, \epsilon, t} || \epsilon - \epsilon_{\theta} ( \sqrt{\bar{\alpha}_{t}} x_{0} + \sqrt{1-\bar{\alpha}_{t}} \epsilon, t, c) ||_{1}$
  - $\alpha_{t} = 1-\beta_{t}$, $\bar{\alpha}_{t} = \prod^{t}_{t'=1} \alpha_{t'}$
  - $t$ : 전체 diffusion time-step에서 uniform 하게 가져오는 값
  -> Diff-TTS는 모델 $\epsilon_{\theta}(\cdot)$의 output과 Gaussian noise $\epsilon \sim \mathcal{N}(0,I)$ 사이의 L1 loss를 제외한 다른 auxiliary loss를 필요로 하지 않음
- 추론 과정에서 Diff-TTS는 $\epsilon_{\theta}(x_{t}, t,c)$를 사용해 각 forward transition에서 추가되는 diffusion noise를 반복적으로 예측
  - 이후 아래와 같이 corrupted part를 제거하여 latent variable에서 mel-spectrogram을 복원:$x_{t-1} = \frac{1}{\sqrt{\alpha_{t}}} (x_{t} - \frac{1-\alpha_{t}} { \sqrt{1-\bar{\alpha}_{t}} } \epsilon_{\theta} (x_{t}, t,c)) + \sigma_{t}z_{t}$
  - $z_{t} \sim \mathcal{N}(0, I)$, $\sigma_{t} = \eta \sqrt{ \frac{ 1-\bar{\alpha}_{t-1} }{ 1-\bar{\alpha}_{t} } \beta_{t} }$
  - $\eta$ : temperature term으로, variance의 scaling factor
- Diffusion time-step $t$는 Diff-TTS의 input으로도 사용되어 모든 diffusion time-step에 대한 shared parameter를 허용함
  - 결과적으로 최종 mel-spectrogram 분포 $p(x_{0}|c)$는 현재의 모든 time-step에 대한 반복 sampling으로 얻어짐

- Accelerated Sampling

Denoising diffusion은 diffusion step이 크면 추론 시간이 오래 걸림
- 이를 해결하기 위해 Denoising Diffusion Implicit Model (DDIM)은 accelerated sampling을 도입했음
- Accelerated sampling은 전체 추론 trajectory의 subsequence에 대해 sample을 생성함
  - Reverse transition 수를 줄여도 sample 품질이 크게 저하되지 않는 장점이 있음
  -> Sample 품질을 유지하면서 합성 속도를 향상하기 위해 Diff-TTS에도 accelerated sampling을 도입
Diff-TTS 적용을 위해, reverse transition은 decimation factor $\gamma$에 의해 생략됨
- 새로운 reverse transition은 기존의 reverse path에서 periodically selected transition으로 구성됨
- $\tau = [ \tau_{1}, \tau_{2}, ..., \tau_{M} ] \, (M < T)$를 time-step $1, 2, ..., T$에서 sample 된 새로운 reverse path라고 하면,
  1. $i>1$에서 accelerated sampling equation은:
    $x_{\tau_{i-1}} = \sqrt{\bar{\alpha}_{\tau_{i-1}}} (\frac{ x_{\tau_{i}} - \sqrt{ 1-\bar{ \alpha}_{\tau_{i}} \epsilon_{\theta} (x_{\tau_{i}}, \tau_{i}, c) } } { \sqrt{\bar{\alpha}_{\tau_{i}}} } ) + \sqrt{1-\bar{\alpha}_{\tau_{i-1}} - \sigma^{2}_{\tau_{i}} } \epsilon_{\theta} (x_{\tau_{i}}, \tau_{i}, c) + \sigma_{\tau_{i}} z_{\tau_{i}} $
    - 이때, $\sigma_{\tau_{i}} = \eta \sqrt{ \frac{1-\bar{\alpha}_{\tau_{i-1}}} {1-\bar{\alpha}_{\tau_{i}}} \beta_{\tau_{i}}}$
  2. $i=1$에서 sampling equation은:
    $x_{0} = \frac { x_{\tau_{1}} - \sqrt{ 1-\bar{\alpha}_{\tau_{1}}} \epsilon_{\theta} (x_{\tau_{1}}, \tau_{1}, c) } { \sqrt{\bar{\alpha}_{\tau_{1}}} } $
- Accelerated sampling을 사용함으로써 Diff-TTS는 subsequence $\tau$에 대해 sampling 하는 경우에도 fidelity가 높은 mel-spectrogram을 생성 가능

- Model Architecture

Diff-TTS는 text encoder, step encoder, duration predictor, decoder로 구성됨
Encoder
- Phoneme sequence에서 contextual information을 추출한 다음, duration predictor와 decoder에 전달
- Encoder Pre-net은 embedding, fully-connected (FC) layer, ReLU acitvation을 가짐
  - Text encoder는 phoneme embedding을 input으로 사용
- Encoder 모듈은 dilated convolution이 있는 10개의 residual block과 LSTM으로 구성됨
Duration Predictor and Length Regulator
- Diff-TTS는 phoneme과 mel-spectrogram sequence의 length를 일치시키기 위해 length regulator를 사용
- Length regulator는 phoneme sequence를 확장하고 음성 속도를 제어하기 위해 alignment inofrmation을 필요로 함
  - Montreal Forced Alignment (MFA)를 활용
- Duration predictor는 MFA에서 추출된 duration을 사용해 logarithmic domain에서 예측을 수행
  - Duration predictor는 L1 loss를 최소화하는 것으로 최적화됨
Decoder and Step Encoder
- Decoder는 phoneme embedding과 diffusion step embedding을 condition으로 하는 $t$-th step latent variable로부터 Gaussian noise를 예측
  - Decoder는 diffusion time-step에 대한 정보를 얻기 위해 step encoder로부터 step embedding을 수행하므로 각 diffusion time-step은 서로 다른 $\epsilon_{\theta} (\cdot, t,c)$를 가짐
  - Diffusion step embedding은 각 $t$에 대해 128차원의 encoding vector를 사용하는 sinusodial embedding
- Step encoder는 2개의 FC layer와 Swish activation으로 구성
- Decoder network는 Conv1D, $1 \times 1$ convolution, tanh, sigmoid로 구성된 residual block의 stack으로 구성
  - 최종적으로 decoder는 phoneme sequence와 diffusion time-step에 해당하는 Gaussian noise를 얻음

3. Experiments

- Settings

Dataset : LJSpeech
Comparison : Tacotron2, Glow-TTS

- Audio Quality and Model Size

Audio Quality
- Diff-TTS는 $T=400, \gamma=1$에서 가장 좋은 음성 품질을 달성
- Accelerated sampling이 음성 품질을 크게 저하시키지 않으면서 빠른 합성을 가능하게 함

Model Size
- Tacotron2나 Glow-TTS에 비해 Diff-TTS는 53% 더 적은 parameter 수를 보임
- Diff-TTS는 denoising diffusion framework를 활용하기 때문에 메모리 효율적임

- Inference Speed

Real-Time Factor (RTF) 측면에서, Diff-TTS는 $\gamma=57$일 때 0.035 RTF를 달성하여 Tacotron2를 앞지름
- Diff-TTS는 Glow-TTS에 비해서는 RTF가 느리지만, 합성 품질과 추론 속도를 조절할 수 있다는 장점이 있음

- Variablity and Controllability

생성된 음성의 prosody는 추론 과정에서 추가되는 latent representation $x_{T}$와 Gaussian noise $\sigma_{t}$에 따라 달라짐
- Diff-TTS는 latent space 및 additive noise에 temperature term을 곱하여 prosody를 제어할 수 있음
- 서로 다른 temperature term $\eta \in \{ 0.2, 0.6 \}$에 대해 F0 trajectory를 비교해 보면, temperature term이 클수록 합성 품질을 유지하면서 더 다양한 음성이 생성됨

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] LiteTTS: A Lightweight Mel-spectrogram-free Text-to-wave Synthesizer Based on Generative Adversarial Networks (0)	2024.01.08
[Paper 리뷰] Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search (0)	2023.12.20
[Paper 리뷰] LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech (0)	2023.12.17
[Paper 리뷰] FastPitch: Parallel Text-to-Speech with Pitch Prediction (0)	2023.12.14
[Paper 리뷰] StreamSpeech: Low-Latency Neural Architecture For High-Quality On-Device Speech Synthesis (0)	2023.12.13

최근에 올라온 글

최근에 달린 댓글

« 2025/12 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Diff-TTS: A Denoising Diffusion Model for Text-to-Speech

Diff-TTS: A Denoising Diffusion Model for Text-to-Speech

1. Introduction

2. Diff-TTS

- Denoising Diffusion Model for TTS

- Accelerated Sampling

- Model Architecture

3. Experiments

- Settings

- Audio Quality and Model Size

- Inference Speed

- Variablity and Controllability

'Paper > TTS' 카테고리의 다른 글

티스토리툴바