[Paper 리뷰] ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech

티스토리 뷰

Paper/TTS

[Paper 리뷰] ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech

feVeRin 2024. 5. 4. 12:14

ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech

Diffusion model은 text-to-speech에서 우수한 성능을 보이고 있지만, iterative sampling process로 인해 accleration의 한계가 있음
특히 gradient-based model은 높은 품질을 보장하기 위해 수천번의 iteration이 필요함
ProDiff
- 고품질의 text-to-speech를 위한 progressive fast diffusion model
- Sampling accleration 시 발생하는 품질 저하를 방지하기 위해 clean data를 직접 예측하여 desnoising model을 parameterization
- Diffusion iteration을 줄임으로써 발생하는 모델 수렴 문제를 해결하기 위해, knowledge distillation을 활용
  - $N$-step DDIM teacher로부터 생성된 mel-spectrogram을 training target으로 사용하여 $N/2$-step의 new model로 distill 함
논문 (MM 2022) : Paper Link

1. Introduction

Text-to-Speech (TTS)는 text로부터 해당하는 음성을 합성하는 것을 목표로 함
- 기존에는 autoregressive 하게 mel-spectrogram을 생성한 다음, 개별적인 vocoder로 mel-spectrogram을 음성으로 변환하는 방식을 사용함
  - 한편으로 non-autoregressive 방식은 autoregressive 방식보다 빠른 합성이 가능하지만 sample quality나 diversity의 한계가 있음
- 즉, TTS 모델에는 다음의 어려움이 존재함:
  1. High-Quality : 합성된 음성의 naturalness를 향상하기 위해, 모델은 adjacent harmonics에 대한 frequency bin, unvoiced frame, high-frequency part 등의 detail을 capture 할 수 있어야 함
  2. Fast : 모델은 real-time 합성을 위해 빠른 생성 속도를 가져야 함
  3. Diverse : 긴 음성에 대해 dull, tedious 한 생성을 방지하기 위해, 모델은 mode collapse와 unimodal prediction을 회피할 수 있어야 함
- 한편으로 Denoising Diffusion Probabilistic Model (DDPM)과 같은 diffusion model은 생성 작업에서 우수한 성능으로 주목받고 있지만, 다음 2가지의 한계점이 있음:
  1. DDPM은 score matching objective를 사용하는 gradient-based model으로써, 높은 sample 품질을 보장하기 위해서는 수천번 이상의 denoising step이 필요함
    - 즉, real-world 사용에 한계가 있음
  2. Refinement iteration을 줄이면 생성 속도를 높일 수 있지만, complex data distribution으로 인해 blurry/over-smooth mel-spectrogram이 만들어짐
    - 즉, sampling step을 줄이면 perceivable background noise가 발생하고 합성 품질이 저하됨
- 이때 neural network를 통해 clean data를 직접 예측하여 denoising model을 parameterization 하는 generator-based method를 활용하면 diffusion model의 sampling을 accelerating 할 수 있음

-> 그래서 고품질 TTS가 가능하면서 더 빠른 sampling이 가능한 progressive fast diffusion model인 ProDiff를 제안

ProDiff
- Reverse iteration을 줄일 때 발생하는 perceptual quality의 저하를 방지하기 위해, score matching에 대한 gradient 추정 대신 clean data $x$를 직접 예측하는 방식을 사용함
- Diffusion iteration 감소로 인한 모델 수렴 문제를 해결하기 위해, knowledge distillation을 활용
  - 구체적으로, denoising model은 $N$-step DDIM teacher로부터 생성된 mel-spectrogram을 training target으로 사용하고 $N/2$-step의 new model로 behavior를 distill 함

< Overall of ProDiff >

TTS 작업에서 다양한 diffusion parameterization을 비교하여 score matching을 활용하는 기존의 gradient-based DDPM 보다 clean data를 직접 예측하는 것이 sampling acceleration에 효과적임을 보임
Diffusion acceleration에서 발생하는 수렴 문제를 해결하기 위해 knowledge distillation을 적용
결과적으로 기존 모델들보다 빠른 합성이 가능하면서 고품질의 음성을 얻음

2. Background on Diffusion Models

먼저 diffusion/reverse proocess는 diffusion probabilistic model을 통해 제공되고, data distribution을 학습하는 denoising neural network $\theta$에 사용될 수 있음
- Pre-defined fixed noise schedule $\beta$와 diffusion step $t$를 사용하여 diffusion/reverse process에 해당하는 constant를 계산하면 다음과 같음:
  (Eq. 1) $\alpha_{t}=\prod_{i=1}^{t}\sqrt{1-\beta_{i}},\,\,\, \sigma_{t}=\sqrt{1-\alpha_{t}^{2}}$
- Diffusion Process
  1. Data distribution을 $q(x_{0})$로 정의하면, diffusion process는 data $x_{0}$에서 latent variable $x_{T}$까지의 fixed Markov chain으로 정의됨:
    (Eq. 2) $q(x_{1},...,x_{T}|x_{0})=\prod_{t=1}^{T}q(x_{t}|x_{t-1})$
    - Small positive constant $\beta_{t}$에 대해, small Gaussian noise가 $q(x_{t}|x_{t-1})$의 function하에서 $x_{t}$에서 $x_{t-1}$의 distribution에 추가됨
  2. 전체 diffusion process는 fixed noise schedule $\beta_{1},...,\beta_{T}$에 따라 data $x_{0}$를 점진적으로 whitened latent $x_{T}$로 변환함:
    (Eq. 3) $q(x_{t}|x_{t-1}) := \mathcal{N}(x_{t};\sqrt{1-\beta_{t}}x_{t-1},\beta_{t}I)$
- Reverse Process
  1. Reverse process는 shared $\theta$에 의해 parameterize 된 $x_{T}$에서 $x_{0}$까지의 Markov chain으로, Gaussian noise에서 sample을 recover 하는 것을 목표로 함:
    (Eq. 4) $p_{\theta}(x_{0},...,x_{T-1}|x_{T})=\prod_{t=1}^{T}p_{\theta}(x_{t-1}|x_{t})$
  2. 이때 각 iteration은 diffusion process에서 추가된 Gaussian noise를 제거함:
    (Eq. 5) $p_{\theta}(x_{t-1}|x_{t}):=\mathcal{N}(x_{t-1};\mu_{\theta}(x_{t},t),\sigma_{t}^{2}I)$
- 이러한 diffusion model은 다양한 data distribution을 학습할 수 있지만, reverse sampling 중에 target distribution을 reconstruct 하기 위해 수천번의 iterative step이 필요함
  - 따라서 ProDiff는 reverse iteration을 줄이고 computational efficiency를 향상하기 위해 progressive fast conditional diffusion model을 구축하는 것을 목표로 함

3. Diffusion Model Parameterization

Implied prediction이 가능한 방식으로 reverse denoising model $\theta$를 parameterization 하기 위해 다음의 2가지 방식을 고려
1. Gradient-based Method : denoising model이 data log-density의 gradient를 학습하고 $\epsilon$ space에서 sample을 예측하는 방식
2. Generator-based Method : denoising model이 clean data $x_{0}$를 직접 예측하고 sample reconstruction error를 최적화하는 방식

- Gradient-based Method

Stein score function은 data $x$에 대한 data log-density $\log p(x)$의 gradient
- 여기서 Stein score function $s(\cdot) =\nabla_{x}\log p(x)$가 주어지면 Langevin dynamics를 통해 해당 density로부터 sample $\tilde{x}\sim p(x)$를 얻을 수 있고, 이는 data space에서의 stochastic gradient ascent로 볼 수 있음:
  (Eq. 6) $\tilde{x}_{t+1}=\tilde{x}_{t}+\frac{\eta}{2}s(\tilde{x}_{t})+\sqrt{\eta}z_{t}$
  - $\eta>0$ : step size, $z_{t}\sim\mathcal{N}(0,I)$
- Score matchin network는 Stein score function $s(\cdot)$을 학습하고, 추론을 위해 Langevin dynmaics를 사용하는 방식으로써, 이때 모든 step $t$에 대해 denoising score matching objective는:
  (Eq. 7) $\mathbb{E}_{x\sim p(x)}\mathbb{E}_{\tilde{x}\sim q(\tilde{x}|x)}\left[ ||s_{\theta}(\tilde{x})-\nabla_{\tilde{x}}\log q(\tilde{x}|x)||_{2}^{2}\right]$
  - $\nabla_{\tilde{x}}\log q(\tilde{x}|x)=-\frac{\epsilon}{\sigma_{t}}$ : Gaussian noise $\epsilon$에 비례
- 한편으로 Denoising Diffusion Probabilistic Model (DDPM)은 neural network를 사용하여 $\epsilon$을 직접 예측하고, denoising model $\theta$를 parameterize 함
  - 특히 이러한 DDPM과 score matching neural network는 서로 밀접하게 연결되어 있고, 이를 Gradient-based method라고 함
- Gradient-based method에서 training objecitve는 $\epsilon$ space의 mean squared error로 정의되고, stochastic gradient descent를 사용해 $t$의 random term을 최적화하는 것으로 training 됨:
  (Eq. 8) $\mathcal{L}_{\theta}^{Grad}=\left|\left| \epsilon_{\theta}\left(\alpha_{t}x_{0}+\sqrt{1-\alpha^{2}_{t}}\epsilon\right)-\epsilon\right|\right|_{2}^{2},\,\,\, \epsilon\sim\mathcal{N}(0,I)$
  - 해당 gradient-based diffusion model에서 고품질을 보장하기 위해서 수천번 이상의 denoising step이 필요하므로 상당한 computational cost가 요구됨

- Generator-based Method

Generator-based method는 diffusion model을 clean data를 직접 예측하여 denoising model을 parameterization 하는 것으로 해석함
- $x_{t}$에는 다양한 level의 perturbation이 존재하므로, single gradient-based parameterization network를 사용하여 서로 다른 $t$에서 $x_{t-1}$을 직접 예측하는 것은 어려움
- Generator-based method는 data density에 대한 gradient를 추정하지 않는 대신, perturbed $x_{0}$를 예측한 다음 posterior distribution $q(x_{t-1}|x_{t},x_{0})$를 사용하여 perturbation을 추가하는 식으로 동작함
  1. 구체적으로 $p_{\theta}(x_{0}|x_{t})$는 주어진 $x_{t}$에서 $x_{0}$를 output 하는 neural network $f_{\theta}(x_{t},t)$에 의해 impose 된 implicit distribution이고,
  2. 이후 $x_{t-1}$은 주어진 $x_{t}$와 예측된 $x_{0}$에 의해 posterior distribution $q(x_{t-1}|x_{t},x_{0})$를 사용하여 sampling 됨
- 이때 training loss는 data $x$ space의 mean squared error로 정의되고, stochastic gradient descent를 통해 $t$의 random term을 최적화하여 training 됨:
  (Eq. 9) $\mathcal{L}_{\theta}^{Gen}=\left|\left| x_{\theta}\left(\alpha_{t}x_{0}+\sqrt{1-\alpha^{2}_{t}}\epsilon\right)-x_{0}\right|\right|_{2}^{2},\,\,\, \epsilon\sim\mathcal{N}(0,I)$
  - 최근에는 neural network $f_{\theta}$로 $x_{0}$를 직접 예측하여 denoising model을 parameterize 함- 이러한 generator-based method는 complex distribution에서 sampling을 accelerating 하는데 효과적임

4. ProDiff

- Motivation

Diffusion model은 audio 합성에서 우수한 성능을 보이고 있음
- BUT, real-world deployment를 위해서는 몇 가지 어려움이 존재함:
  1. 기존의 diffusion TTS model은 일반적으로 score matching objective를 사용하여 data density에 대한 graident를 추정하는 방식으로 동작함
    - 결과적으로 고품질의 합성을 위해서는 상당한 iteration이 필요
  2. 특히 refinement iteration을 줄이는 경우, diffusion model은 complex data distribution으로 인해 모델의 수렴성이 저하됨
    - 따라서 기존의 denoising model은 deterministic value를 생성하지 못하므로 예측된 mel-spectrogram은 blurry 하고 over-smooth 해짐
- ProDiff는 이를 해결하기 위해 2가지 technique을 활용함:
  1. Sampling 속도 향상을 위한 generator-based parameterization
  2. Knowledge distillation을 통한 target side의 data variance 감소
- 결론적으로, 이를 통해 ProDiff는 sampling step을 크게 줄이면서도 고품질의 합성이 가능

- Select a Teacher

Teacher model은 fast, high-quality, diverse TTS 합성이 가능해야 하고, distilled student는 해당 capability를 inherit 해야 함
- Diffusion parameterization에 대한 앞선 분석을 기반으로, 논문은 4-step generator-based model이 품질과 속도 사이의 최적의 trade-off를 유지하는 것을 발견함
- 따라서 ProDiff의 teacher로써 4-diffusion step을 가지는 generator-based diffusion model $\theta$를 사용

- Distill from Teacher

Denoising Diffusion Implicit Model (DDIM)은 Denoising Diffusion Probabilistic Model (DDPM)과 동일한 training procedure를 사용하면서 추론을 가속할 수 있는 non-Markovian process를 활용함
- 이때 sampler를 사용하여 diffusion process에서 variance가 reduce 된 coarse-grained mel-spectrogram을 직접 예측함
- 먼저 동일한 parameter와 model definition을 사용해 teacher model의 copy로 ProDiff를 initialize 함
  1. 그리고 training set에서 data를 sampling 하고 original에 noise를 추가함
  2. 한편으로 original data $x_{0}$ 대신 teacher를 사용하여 2번의 DDIM sampling step을 통해 denoising model에 대한 target value $\hat{x}_{0}$를 얻을 수 있음
  3. 추가적으로 teacher의 single DDIM step이 teacher의 2-DDIM step과 match 되도록 하여 required step을 절반으로 줄임
- 결과적으로 knowledge distillation을 통해 student model을 training 하는 것은 아래 [Algorithm 1]과 같이 동작함

- Architecture

ProDiff의 architecture는 대표적인 non-autoregressive TTS 모델인 FastSpeech2를 활용함
- 아래 그림과 같이 ProDiff는 phoneme encoder, variance adaptor, spectrogram denoiser로 구성됨
  - Phoneme encoder는 phoneme embedding sequence를 hidden sequence로 변환
  - Variance adpator는 각 phoneme의 duration을 예측하여 hidden sequence length를 speech frame의 length로 regulate 하고 pitch/energy 등의 variance를 반영
  - Spectrogram denoiser는 length-regulated hidden sequence를 mel-spectrogram으로 iteratively refine 함
- Encoder and Variance Adaptor
  1. Phoneme encoder는 transformer architecture를 기반으로 하는 feed-forward transformer (FFT) block으로 구성됨
  2. Encoder는 pre-net, multi-head self-attention을 포함하는 transformer block, final linear projection layer로 구성됨
  3. Variance adaptor에서 duration, pitch, energy predictor는 ReLU activation과 2-layer 1D convolution network를 기반으로 함
    - 각 network에는 layer normalization, dropout layer, linear layer가 이어지고 hidden state를 output sequence로 project 함
- Spectrogram Denoiser
  1. DiffSinger를 따라 non-causal WaveNet architecture를 denoiser로 채택함
  2. Decoder는 256개 channel로 input hidden sequence를 project 하기 위해 residual connection이 있는 $1\times 1$ convolution layer와 $N$개의 convolution block으로 구성됨
    - 이때 모든 step $t$에 대해 cosine schedule $\beta_{t}=\cos(0.5 \pi t)$를 사용

- Training Loss

ProDiff를 최적화하기 위해 다음의 objective를 활용함
- Sample Reconstruction Loss
  - Original clean data $x_{0}$를 사용하는 대신, teacher에서 2-DDIM sampling step을 수행하여 reduced variance의 target value $\hat{x}_{0}$를 얻음:
  (Eq. 10) $\mathcal{L}_{\theta}=\left|\left| x_{\theta}\left(\alpha_{t}x_{0}+\sqrt{1-\alpha_{t}^{2}}\epsilon\right)-\hat{x}_{0}\right|\right|_{2}^{2},\,\,\,\epsilon\sim\mathcal{N}(0,I)$
- Structural Similarity Index (SSIM) Loss
  - SSIM은 structural information과 texture를 capture 하여 image quality를 측정하는 perceptual metric
  - SSIM은 $0~1$ 사이의 값을 가지고, 1의 경우 완벽한 perceptual quality를 의미
  - ProDiff training에서 해당 SSIM loss는:
  (Eq. 11) $\mathcal{L}_{SSIM}=1-\mathrm{SSIM}\left(x_{\theta}\left(\alpha_{t}x_{0}+\sqrt{1-\alpha_{t}^{2}}\epsilon\right),\hat{x}_{0}\right)$
- Variance Reconstruction Loss
  - 음성의 naturalness와 expressiveness를 향상하기 위해 pitch, duration energy에 대한 acoustic variance information을 제공
  - 추가적으로 acoustic generator를 training 하기 위해 variance reconstruction loss를 추가함:
  (Eq. 12) $\mathcal{L}_{p}=||p-\hat{p}||^{2}_{2}, \,\, \mathcal{L}_{e}=||e-\hat{e}||_{2}^{2},\,\, \mathcal{L}_{dur}=||d-\hat{d}||_{2}^{2}$
  - $d, e, p$ : 각각 target duration, energy, pitch, $\hat{d},\hat{e},\hat{p}$ : 각각 예측된 duration, energy, pitch
  - Loss weight는 모두 0.1로 설정

- Training and Inference Procedures

ProDiff의 training과 sampling은 각각 [Algorithm 1], [Algorithm 2]와 같음
- Training : ProDiff training에 대한 final loss는 다음과 같이 구성됨
  1. Sample reconstruction loss $\mathcal{L}_{\theta}$ : (Eq. 10)을 따르는 예측 mel-spectrogram과 ground-truth 간의 MSE
  2. Structural Similarity Index (SSIM) loss $\mathcal{L}_{SSIM}$ : (Eq. 11)에 따라 예측된 mel-spectrogram과 ground-truth 간의 SSIM index를 1에서 뺀 값
  3. Variance reconstruction loss $\mathcal{L}_{dur}, \mathcal{L}_{p},\mathcal{L}_{e}$ : (Eq. 12)에 따라 예측된 phoneme-level duration, pitch spectrogram, energy와 ground-truth 간의 MSE
- Inference : 추론 시 ProDiff는 unperturbed $x_{0}$를 iteratively predict 한 다음, posterior distribution을 통해 perturbation을 추가한 다음, high-fidelity의 mel-spectrogram을 생성함
  1. 먼저 denoising model $f_{\theta}(x_{t}|t,c)$는 $\hat{x}_{0}$를 예측하고,
  2. $x_{t-1}$과 예측된 $\hat{x}_{0}$가 주어지면 posterior distribution $q(x_{t-1}|x_{t},x_{0})$를 사용하여 sampling 됨
  3. 최종적으로 생성된 spectrogram $x_{0}$는 pre-trained vocoder를 통해 waveform으로 변환됨

5. Experiments

- Settings

Dataset : LJSpeech
Comparisons : Tacotron2, FastSpeech2, GANSpeech, Glow-TTS, Grad-TTS, DiffSinger

- Results

Preliminary Analyses on Diffusion Parameterization
- Gradient-based/Generator-based diffusion parameterization을 비교해 보면
- Noise schedule의 large distribution에 대해 generator-based diffusion model은 고품질의 sample을 합성할 수 있음
- Iterative step을 줄이면 $(T\leq 16)$ gradient-based에서는 perceivable background noise로 인한 성능 저하가 발생함
- 대조적으로 generator-based model은 iteration 감소에도 sample 품질을 유지할 수 있음
  - Generator-based model은 data density에 대한 gradient를 추정하지 않으므로 unperturbed $x_{0}$만 예측한 다음, posterior distribution $q(x_{t-1}|x_{t}, x_{0})$을 통해 perturbation을 반영하면 됨
  - 즉, generator-based model은 complex distribution에서 sampling을 가속할 때, 품질 저하를 방지할 수 있음

Gradient-based, Generator-based Diffusion Model 간 비교

Performances
- ProDiff는 audio quality, sampling speed, diversity 측면에서 모두 우수한 성능을 보임

Visualizations
- 합성된 mel-sepctrogram을 확인해 보면
- Tacotron2, FastSpeech2와 같은 non-probabilistic model은 blurry 하고 over-smooth mel-spectrogram을 생성함
- GAN-based model은 mode collapse 문제가 발생함
- Diffusion probabilistic model들은 rich frequency detail을 가지는 mel-spectrogram을 생성함
  - 특히 ProDiff는 knowledge distillation을 통해 더 나은 성능을 달성할 수 있음

Progressive Diffusion
- ProDiff teacher와 ProDiff는 unperturbed $x_{0}$를 예측한 다음, posterior distribution을 사용해 perturbation을 추가하는 방식을 사용함
- 이를 통해 더 적은 reverse iteration을 사용하면서 더 나은 결과를 얻을 수 있음

(a) Mel-Spectrogram Sequence Length에 대한 Latency 비교 (b) Reverse Denoising Iteration에 대한 MCD 비교

Ablation Study
- 제한된 diffusion iteration에서 ProDiff의 generator-based parameterization을 gradient-based로 대체하면 성능 저하가 발생함
- Distillation을 제거하고 clean data를 training target으로 사용하는 경우 over-smooth 한 예측이 발생함
- 추가적으로 4-step teaccher로부터 distillation 하는 것이 computation cost와 quality 간 최적의 trade-off를 제공함

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] MQTTS: A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech (0)	2024.05.07
[Paper 리뷰] IST-TTS: Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge (0)	2024.05.05
[Paper 리뷰] PAVITS: Exploring Prosody-Aware VITS for End-to-End Emotional Voice Conversion (0)	2024.05.02
[Paper 리뷰] VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature (0)	2024.04.30
[Paper 리뷰] CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model (0)	2024.04.28

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech

ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech

1. Introduction

2. Background on Diffusion Models

3. Diffusion Model Parameterization

- Gradient-based Method

- Generator-based Method

4. ProDiff

- Motivation

- Select a Teacher

- Distill from Teacher

- Architecture

- Training Loss

- Training and Inference Procedures

5. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바