[Paper 리뷰] FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

티스토리 뷰

Paper/Vocoder

[Paper 리뷰] FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

feVeRin 2024. 4. 27. 10:41

FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

Denoising Diffusion Probabilistic Model은 우수한 합성 성능을 보이지만, iterative sampling process로 인해 속도의 한계가 있음
FastDiff
- 고품질의 음성 합성을 위한 fast conditional diffusion model
- 다양한 receptive field pattern의 time-aware location-variable convolution stack을 사용하여 adaptive condition으로 long-term dependency를 모델링
- 품질을 유지하면서 sampling step을 줄이기 위해 noise schedule predictor를 도입
논문 (IJCAI 2022) : Paper Link

1. Introduction

음성 합성을 위한 vocoder로써 generative model을 활용한 다양한 방식들이 제안되고 있음
- 이때 음성 합성 모델은 다음을 모두 만족할 수 있어야 함
  1. High-Quality : 높은 sampling rate에서도 고품질의 음성 합성이 가능해야 함
    - 즉, highly variable pattern을 가지는 waveform에 대해 다양한 time scale의 detail을 reconstruct 할 수 있어야 함
  2. Fast : 빠르고 real-time으로 동작 수 있어야 함
- BUT, 기존의 WaveNet과 같은 autoregressive 방식은 고품질의 합성 능력에 비해 상당한 계산 비용의 문제가 있음
  - 그 외의 flow-based, GAN-based model과 같은 non-autoregressive 방식은 sample diversity의 한계가 있음
- 한편으로 최근의 Denoising Diffusion Probabilistic Model (DDPM)은 최고의 합성 성능을 제시하고 있지만, audio 합성에 적용하기 위해서는 다음의 어려움이 있음
  1. 기존의 generative model과는 달리 diffusion model은 생성된 audio와 reference 간의 차이를 직접적으로 최소화하도록 학습되지 않음
    - 대신 optimal gradient가 주어졌을 때, noisy sample을 denoise 하는 것을 목표로 함
    - 따라서 breathiness나 vocal fold closure 같은 natural voice characteristic이 overly denoise 될 수 있다는 문제가 있음
  2. DDPM의 합성 품질을 보장하기 위해서는 수백번 이상의 denoise step이 필요함
    - 이때 sampling step을 줄이면 perceivable background noise로 인해 품질의 저하가 발생할 수 있음

-> 그래서 고품질의 음성 합성을 지원하면서 빠르게 동작하는 conditional diffusion model인 FastDiff를 제안

FastDiff
- Audio 품질 향상을 위해 다양한 receptive field pattern에 대한 Time-Aware Location-Variable Convolution을 채택하여 adaptive condition으로 long-term time dependency를 모델링
- 추론 가속을 위해 denoising step을 크게 줄일 수 있는, noise schedule predictor를 도입
- 추가적으로 FastDiff를 기반으로 text-to-speech (TTS) 작업에 대한 end-to-end 모델인 FastDiff-TTS로 확장

< Overall of FastDiff >

Time-Aware Location-Variable Convolution stack을 사용하여 adaptive condition으로 long-term dependency를 모델링
품질을 유지하면서 sampling step을 줄이기 위해 noise schedule predictor를 도입
결과적으로 기존 모델들보다 높은 품질을 유지하면서 빠른 추론 속도를 달성

2. Background: Denoising Diffusion Probabilistic Model

Denoising Diffusion Probabilistic Model (DDPM)은 고품질의 합성 성능을 입증한 likelihood-based generative model
- DDPM의 basic idea는 diffusion process를 reverse 하기 위해 gradient neural network를 training 하는 것
- 즉, unknown data distribution $p d a t a (x 0) p_{d a t a} (x_{0}) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>a</mi><mi>t</mi><mi>a</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 에서 $i . i . d . i . i . d . <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>i</mi><mo>.</mo><mi>i</mi><mo>.</mo><mi>d</mi><mo>.</mo></math>$ sample ${x 0 \in R D} <math xmlns="http://www.w3.org/1998/Math/MathML"><mo fence="false" stretchy="false">{</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>D</mi></mrow></msup><mo fence="false" stretchy="false">}</mo></math>$ 가 주어지면, DDPM은 marginal distribution $p θ (x 0) = \int . . . \int p (x T) \prod T t = 1 p θ (x t - 1 | x t) d x 1 . . . d x T <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mo data-mjx-texclass="OP">\int</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo data-mjx-texclass="OP">\int</mo><mi>p</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub><mo stretchy="false">)</mo><munderover><mo data-mjx-texclass="OP">\prod</mo><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></munderover><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>-</mo><mn>1</mn></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo><mi>d</mi><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>.</mo><mo>.</mo><mo>.</mo><mi>d</mi><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub></math>$ 로 $p d a t a (x 0) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>a</mi><mi>t</mi><mi>a</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 를 근사하는 것을 목표로 함
  1. 여기서 $q (x 0) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>q</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 와 같은 data distribution에 대한 diffusion process는, data $x 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 에서 latent variable $x T <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub></math>$ 까지의 fixed Markov chain으로 정의됨
  2. 그리고 small positive constant $β t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 에 대해, small Gaussian noise가 $q (x t | x t - 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>q</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>-</mo><mn>1</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 의 function에 따라 $x t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 에서 $x t - 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>-</mo><mn>1</mn></mrow></msub></math>$ 의 distribution에 추가됨
  3. 결과적으로 전체 diffusion process는 fixed noise schedule $β 1, . . ., β T <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub></math>$ 에 따라 data $x 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 whitened latent $x T <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub></math>$ 로 점진적으로 변환하는 것과 같음
- 한편으로 reverse process는 shared $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$ 에 의해 parameterize 된 $x T <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub></math>$ 에서 $x 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 까지의 Markov chain으로, 각 iteration에서 diffusion process로 추가된 Gaussian noise를 denoise 하여 sample을 recover 하는 방식으로 동작함
  - 이를 통해 DDPM은 고품질의 합성이 가능하지만, reverse sampling 과정에서 target distribution을 reconstruct 하기 위해 수천번의 iterative step이 필요하다는 단점이 있음

3. FastDiff

- Motivation

DDPM을 활용하면 고품질의 sample을 생성할 수 있음
- BUT, audio 생성 작업에서 다음의 몇 가지 어려움이 존재함
  1. Clean audio 대신 noisy audio의 dynamic dependency를 capture 하므로 spectrogram fluctuation 외에도 더 많은 variation information이 반영됨
  2. 제한적인 receptive field로 인해 reverse iteration을 줄이면 성능 저하가 발생할 가능성이 높음
    - 결과적으로 iteration step 절감을 통한 추론 가속이 어려움
- 따라서 FastDiff는 위 문제를 해결하기 위해 다음의 2가지 component를 도입함
  1. Time-Aware Location-Variable Convolution
    - Dynamic dependency에서 noisy sample의 detail을 catch 하는 역할
    - 해당 convolution은 diffusion step, spectrogrm fluctuation을 포함한 음성의 dynamic variation에 따라 condition 되어 모델에 다양한 receptive field pattern을 제공하고 reverse accelceration에 대한 robustness를 보장
  2. Noise Schedule Predictor
    - Reverse iteration을 줄여 추론 속도를 가속하는 역할

- Time-Aware Location-Variable Convolution

기존의 convolution network와 비교하여 location-variable convolution은 audio의 long-term dependency를 효율적으로 모델링할 수 있음
- 이를 기반으로 diffusion model의 time step에 sensitive 한 Time-Aware Location-Variable Convolution을 도입함
  1. 이때 time step $t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi></math>$ 에서 step index를 128-dimensional positional embedding (PE) vector $e t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 에 embedding 함:
    $et=[sin(100×463t),...,sin(1063×463t),cos(100×463t),...,cos(1063×463t)]<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><mi>sin</mi><mo data-mjx-texclass="NONE">⁡</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><msup><mn>10</mn><mrow data-mjx-texclass="ORD"><mfrac><mrow><mn>0</mn><mo>×</mo><mn>4</mn></mrow><mn>63</mn></mfrac></mrow></msup><mi>t</mi><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><mi>sin</mi><mo data-mjx-texclass="NONE">⁡</mo><mo stretchy="false">(</mo><msup><mn>10</mn><mrow data-mjx-texclass="ORD"><mfrac><mrow><mn>63</mn><mo>×</mo><mn>4</mn></mrow><mn>63</mn></mfrac></mrow></msup><mi>t</mi><mo stretchy="false">)</mo><mo>,</mo><mi>cos</mi><mo data-mjx-texclass="NONE">⁡</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><msup><mn>10</mn><mrow data-mjx-texclass="ORD"><mfrac><mrow><mn>0</mn><mo>×</mo><mn>4</mn></mrow><mn>63</mn></mfrac></mrow></msup><mi>t</mi><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><mi>cos</mi><mo data-mjx-texclass="NONE">⁡</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><msup><mn>10</mn><mrow data-mjx-texclass="ORD"><mfrac><mrow><mn>63</mn><mo>×</mo><mn>4</mn></mrow><mn>63</mn></mfrac></mrow></msup><mi>t</mi><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>$
  2. Time-Aware Location-Variable Convolution에서 FastDiff는 input sequence의 associated interval에 대한 convolution을 처리하기 위해 multiple predicted variation-sensitive kernel이 필요함
    - 해당 kernel은 time-aware 하고 diffusion step과 acoustic feature를 포함한 noisy audio variation에 sensitive 해야 함
- 따라서 아래 그림의 (b), (c)와 같이 kernel predictor를 결합하여 Time-Aware Location-Variable Convolution (LVC) module을 구성함
  1. $q <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>q</mi></math>$ -th time-aware LVC layer의 경우, $3 q <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mn>3</mn><mrow data-mjx-texclass="ORD"><mi>q</mi></mrow></msup></math>$ dilation을 가지는 $M <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>M</mi></math>$ -length window를 사용하여 input $x t \in R D <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>D</mi></mrow></msup></math>$ 를 split 한 다음, 각 $x k t \in R M <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow></msubsup><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>M</mi></mrow></msup></math>$ 으로 $K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi></math>$ 개의 segment를 생성함:
    (Eq. 1) ${x 1 t, . . ., x K t} = split (x t; M, q) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo fence="false" stretchy="false">{</mo><msubsup><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msubsup><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msubsup><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow><mrow data-mjx-texclass="ORD"><mi>K</mi></mrow></msubsup><mo fence="false" stretchy="false">}</mo><mo>=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-auto-op="false">split</mi></mrow><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>;</mo><mi>M</mi><mo>,</mo><mi>q</mi><mo stretchy="false">)</mo></math>$
  2. 다음으로 kernel predictor $α <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>α</mi></math>$ 에 의해 생성된 kernel을 사용하여 input sequence의 associated kernel에 대한 convolution을 수행함:
    (Eq. 2) ${F t, G t} = α (t, c) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo fence="false" stretchy="false">{</mo><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><msub><mi>G</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo fence="false" stretchy="false">}</mo><mo>=</mo><mi>α</mi><mo stretchy="false">(</mo><mi>t</mi><mo>,</mo><mi>c</mi><mo stretchy="false">)</mo></math>$
    (Eq. 3) $z k t = tanh (F t * x k t) ⊙ σ (G t * x k t) <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>z</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow></msubsup><mo>=</mo><mi>tanh</mi><mo data-mjx-texclass="NONE"></mo><mo stretchy="false">(</mo><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>*</mo><msubsup><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow></msubsup><mo stretchy="false">)</mo><mo>⊙</mo><mi>σ</mi><mo stretchy="false">(</mo><msub><mi>G</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>*</mo><msubsup><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow></msubsup><mo stretchy="false">)</mo></math>$
    (Eq. 4) $z t = concat ({z 1 t, . . ., z K t}) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-auto-op="false">concat</mi></mrow><mo stretchy="false">(</mo><mo fence="false" stretchy="false">{</mo><msubsup><mi>z</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msubsup><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msubsup><mi>z</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow><mrow data-mjx-texclass="ORD"><mi>K</mi></mrow></msubsup><mo fence="false" stretchy="false">}</mo><mo stretchy="false">)</mo></math>$
    - $F t, G t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><msub><mi>G</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ : 각각 $x i t <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msubsup></math>$ 에 대한 filter kernel, gate kernel, $* <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>*</mo></math>$ : 1D convolution
    - $⊙ <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>⊙</mo></math>$ : element-wise product, $concat (\cdot) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-auto-op="false">concat</mi></mrow><mo stretchy="false">(</mo><mo>\cdot</mo><mo stretchy="false">)</mo></math>$ : vector 간의 concatenation
- 해당 time-aware kernel은 noise-level에 adaptive 하고 acoustic feature에 dependent 하기 때문에, FastDiff는 noisy input에서도 빠른 속도로 denoising gradient를 정확하게 추정할 수 있음

- Accelerated Sampling

Noise Predictor
- 수백~수천 step의 sampling을 피하기 위해 FastDiff는 Bilateral Denoising Diffusion Model (BDDM)에서 도입된 noise scheduling algorithm을 활용하여 training noise schedule 보다 짧은 sampling schedule을 예측함
  - 해당 scheduling method는 WaveGrad의 grid search나 DiffWave의 fast sampling보다 우수한 성능을 보임
- 여기서 noise predictor는 continuous noise schedule $ˆ β \in R T m <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>β</mi><mo stretchy="false">^</mo></mover></mrow><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>m</mi></mrow></msub></mrow></msup></math>$ 을 iteratively derive 함
  1. 구체적으로 BDDM에서는 noise schedule prediction을 위해 tighter Evidence Lower BOund (ELBO)를 도입함
    - 즉, leaned diffusion network $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$ 가 주어지면 scheduling network $ϕ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ϕ</mi></math>$ 는 surrogate objective 간의 차이를 줄이는 것을 목표로 함
    - 여기서 $ˆ β <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>β</mi><mo stretchy="false">^</mo></mover></mrow></math>$ 와 같은 efficient $N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi></math>$ -step noise schedule은 well-leaned noise scheduling network $ϕ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ϕ</mi></math>$ 에 의해 유도될 수 있음
  2. 결과적으로 noise schedule predictor $ϕ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ϕ</mi></math>$ 를 학습하기 위해 forward/reverse distribution에 대한 KL divergence로 loss function을 구성함:
    (Eq. 5) $Lϕ=12(1−βt−α2t)||√1−α2tϵt−βt√1−α2tϵθ(xt,αt)||22+Ct<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow></msub><mo>=</mo><mfrac><mn>1</mn><mrow><mn>2</mn><mo stretchy="false">(</mo><mn>1</mn><mo>−</mo><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>−</mo><msubsup><mi>α</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup><mo stretchy="false">)</mo></mrow></mfrac><msubsup><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">|</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">|</mo><msqrt><mn>1</mn><mo>−</mo><msubsup><mi>α</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup></msqrt><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>−</mo><mfrac><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><msqrt><mn>1</mn><mo>−</mo><msubsup><mi>α</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup></msqrt></mfrac><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><msub><mi>α</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo><mo data-mjx-texclass="CLOSE">|</mo></mrow><mo data-mjx-texclass="CLOSE">|</mo></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup><mo>+</mo><msub><mi>C</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$
    - $Ct=14log1−α2tβt+D2(βt1−α2t−1)<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>C</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mfrac><mn>1</mn><mn>4</mn></mfrac><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><mfrac><mrow><mn>1</mn><mo>−</mo><msubsup><mi>α</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup></mrow><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mfrac><mo>+</mo><mfrac><mi>D</mi><mn>2</mn></mfrac><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mfrac><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow><mn>1</mn><mo>−</mo><msubsup><mi>α</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup></mrow></mfrac><mo>−</mo><mn>1</mn><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$ : training 중에 ignore 되는 constant

Schedule Alignment
- FastDiff는 training 중에 $T = 1000 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>T</mi><mo>=</mo><mn>1000</mn></math>$ 의 discrete time step을 사용함
- Sampling 중에 $t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi></math>$ 에 대한 condition이 필요한 경우, $N ≪ T <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi><mo>≪</mo><mi>T</mi></math>$ 를 사용하여 $T m <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>m</mi></mrow></msub></math>$ -step sampling noise schedule $ˆ β <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>β</mi><mo stretchy="false">^</mo></mover></mrow></math>$ 를 $T <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>T</mi></math>$ -step training noise schedule $β <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>β</mi></math>$ 에 aligning하여 $T m <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>m</mi></mrow></msub></math>$ discrete time index를 근사함

- Training, Noise Scheduling and Sampling

[Algorithm 1]과 같이 FastDiff는,
- 다음의 2개의 module을 개별적으로 parameterize 함:
  1. Iterative refinement model $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$ : score function의 variational bound를 최소화하는 역할
  2. Noise predictor $ϕ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ϕ</mi></math>$ : tighter evidence lower bound를 위해 noise schedule을 최적화하는 역할
- 추론 시에는 [Algorithm 3]과 같이,
  1. 먼저 one-shot noise scheduling procedure를 통해 tighter, efficient noise schedule $ˆ β <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>β</mi><mo stretchy="false">^</mo></mover></mrow></math>$ 를 얻음
    - 이를 통해 FastDiff는 sampling 속도를 가속할 수 있고, searched noise schedule은 고품질 생성을 유지할 수 있을 만큼 충분히 robust 함
  2. 이후 schedule alignment를 사용하여 continuous noise schedule을 discrete time index $T m <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>m</mi></mrow></msub></math>$ 에 mapping
  3. 최종적으로 Gaussian noise를 iteratively refine 하여 고품질의 sample을 생성함

- FastDiff-TTS

기존의 text-to-speech (TTS) 모델은 acoustice module, vocoder의 two-stage pipeline을 사용함
- 이러한 TTS pipeline을 단순화하기 위해 FastDiff를 확장하여 intermediate feature를 사용하지 않는 fully end-to-end model인 FastDiff-TTS를 구성
- 즉, FastDiff-TTS는 explicit 하게 mel-spectrogram을 생성할 필요 없이 phoneme과 같은 context에서 waveform을 직접 생성하는 것을 목표로 함
- Architecture
  1. FastDiff-TTS의 architecture는 non-autoregressive TTS 모델인 FastSpeech2를 backbone으로 함
  2. 먼저 encoder에서는 phoneme embedding sequence를 phoneme hidden sequence로 변환하고, duration predictor는 desired waveform output과 일치하도록 encoder output을 expand 함
  3. 이때 aligned sequence가 주어지면 variance adaptor는 hidden sequence에 pitch information을 추가함
  4. 최종적으로 FastDiff를 vocoder로 사용하여 adapted hidden sequence를 speech waveform을 변환함
- Training Loss
  1. FastDiff-TTS는 기존 TTS 모델과 달리 sample 품질 향상을 위한 additional loss나 adversarial training이 필요 없음
    - 즉, 이를 통해 FastDiff-TTS는 TTS 과정을 크게 단순화함
  2. 여기서 FastDiff-TTS의 final training loss는 다음 term들로 구성됨:
    - Duration prediction loss $L d u r <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>u</mi><mi>r</mi></mrow></msub></math>$ : log-scale에서 예측된 word-level duration과 ground-truth 간의 mean squared error
    - Diffusion loss $L d i f f <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>i</mi><mi>f</mi><mi>f</mi></mrow></msub></math>$ : 추정된 noise와 Gaussian noise 간의 mean squared error
    - Pitch reconstruction loss $L p i t c h <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>i</mi><mi>t</mi><mi>c</mi><mi>h</mi></mrow></msub></math>$ : 예측된 pitch sequence와 ground-truth 간의 mean squared error
  3. 이때 pitch recontruction loss $L p i t c h <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>i</mi><mi>t</mi><mi>c</mi><mi>h</mi></mrow></msub></math>$ 는 TTS의 one-to-many mapping 문제를 처리하는데 도움을 줌

4. Experiments

- Settings

Dataset : LJSpeech, VCTK
Comparisons : WaveNet, WaveGlow, HiFi-GAN, UnivNet, DiffWave, WaveGrad

- Results

Comparison with Other Models
- MOS 측면에서 FastDiff는 ground-truth와 0.24의 차이만을 보이면서 가장 높은 품질을 달성했고 PESQ, STOI 측면에서도 상당한 개선을 보임
- 추론 속도 측면에서 FastDiff는 다른 diffusion architecture들과는 달리 4번의 reverse step 만으로도 고품질의 음성을 생성할 수 있음

Ablation Study
- FastDiff의 time-aware location-variable convolution을 일반적인 convolution으로 대체하는 경우 sampling 속도와 품질이 크게 저하됨
- Noise predictor 대신 grid search를 사용하는 경우에도 오디오 품질의 저하가 나타남
- 한편으로 discrete time step을 condition으로 사용할 때 FastDiff는 더 우수한 품질의 sample을 합성할 수 있음

Generalization to Unseen Speakers
- Unseen speaker의 mel-spectrogram inversion 성능을 확인해 보면,
- FastDiff가 out-of-domain generalization 측면에서도 가장 우수한 것으로 나타남

End-to-End Text-to-Speech
- FastDiff를 TTS 작업으로 확장한 FastDiff-TTS의 TTS 성능을 비교해 보면
- FastDiff-TTS는 FastSpeech2와 같은 기존 TTS 모델보다 더 우수한 MOS 성능을 달성함

'Paper > Vocoder' 카테고리의 다른 글

[Paper 리뷰] StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization (0)	2024.05.01
[Paper 리뷰] Framewise WaveGAN: High Speed Adversarial Vocoder in Time Domain with Very Low Computational Complexity (0)	2024.04.29
[Paper 리뷰] LangWave: Realistic Voice Generation based on High-Order Langevin Dynamics (0)	2024.04.22
[Paper 리뷰] Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform Generation in Multiple Domains (0)	2024.04.17
[Paper 리뷰] BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis (0)	2024.04.14

Let IT Begin Voice Engineer | 심심하면 앨범 리뷰 올립니다

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

1. Introduction

2. Background: Denoising Diffusion Probabilistic Model

3. FastDiff

- Motivation

- Time-Aware Location-Variable Convolution

- Accelerated Sampling

- Training, Noise Scheduling and Sampling

- FastDiff-TTS

4. Experiments

- Settings

- Results

'Paper > Vocoder' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역