[Paper 리뷰] CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models

티스토리 뷰

Paper/TTS

[Paper 리뷰] CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models

feVeRin 2024. 5. 25. 12:51

CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models

Text-to-Speech에서 diffusion model을 사용하면 high-fidelity의 음성을 합성할 수 있지만 multi-step sampling으로 인해 real-time synthesis에는 한계가 있음
한편으로 GAN과 diffusion model을 결합하여 denoising distribution을 근사하는 방식으로 추론 속도를 개선할 수 있지만, adversarial training으로 인해 모델 수렴의 어려움이 있음
CM-TTS
- Consistency Model (CM)을 기반으로 adversarial training이나 pre-train model dependency 없이 더 적은 step으로 continuous-time diffusion model에서 고품질 음성 합성을 지원
- 추가적으로 dynamic probability를 통해 다양한 sampling position을 모델에 반영하여 training process에서 unbiased learning을 보장하는 weighted sampler를 도입
논문 (NAACL 2024) : Paper Link

1. Introduction

Text-to-Speech (TTS) 모델은 크게 autoregressive (AR), non-autoregressive (NAR) 방식을 활용할 수 있음
- AR framework는 attention mechanism을 기반으로 한 RNN model을 사용해 spectrogram을 sequential 하게 생성함
  - 안정적인 합성을 보장할 수 있지만, prediction error가 accumulate 되고 추론 속도가 느리다는 한계가 있음
- NAR framework의 경우, AR framework와 달리 parallel feed-forward network를 통해 complexity를 줄이고 real-time 합성을 지원할 수 있음
  - 실제로 GAN, Flow-based model은 TTS에서 성공적으로 사용되고 있음
  - 특히 최근의 diffusion model은 noise addition이 포함된 forward diffusion process와 parameterized reverse iterative denoising process를 사용하여 고품질 합성 능력을 보임
- BUT, diffusion model은 고품질 TTS가 가능하다는 장점에도 불구하고 Markov chain으로 인한 multi-step iterative sampling으로 인해 효율성의 근본적인 한계가 있음
  1. 이때 CoMoSpeech는 consistency model을 도입하여 해당 문제를 해결함
    - Well-designed diffusion-based teacher model에서 consistency constraint를 적용해 distil 한 model을 활용하여 singe diffusion step으로 음성을 합성하는 방식
    - BUT, 해당 방식은 teacher model에 대한 distillation에 의존하므로 training pipeline이 복잡하고, multi-speaker 환경으로 constraint를 확장하기 어려움
  2. 한편으로 DiffGAN-TTS의 경우, GAN을 diffusion model에 통합함으로써 sampling step을 줄임
    - BUT, 해당 방식은 discriminator에 대한 additional training으로 인해 모델 수렴의 어려움이 있음
  3. 그 외에도 추론 속도 향상을 위해 DiffSinger와 같이 shallow diffusion mechanism을 고려할 수 있음
    - BUT, pre-trained model을 추가하는 경우 architecture가 복잡해짐

-> 그래서 teacher model의 distillation에 의존하지 않고 diffusion TTS의 추론 속도를 개선하는 consistency model인 CM-TTS를 제안

CM-TTS
- Continuous-time diffusion과 consistency model을 기반으로, 음성 합성을 generative consistency procedure로 frame 하여 single-step 만으로도 우수한 합성 품질을 달성
  - 특히 이를 통해 기존 diffusion TTS 모델에서 사용된 adversarial training, auxiliary pre-trained model에 대한 의존성을 제거함
- 추가적으로 weighted sampler를 도입하여 training efficiency를 향상하고 sampling bias를 mitigate 함

< Overall of CM-TTS >

Consistency model을 기반으로 효율적인 real-time, few-step iterative generation이 가능
Single-step generation을 지원하고, additional adversarial training, pre-trained model에 대한 의존성을 제거
다양한 sampling point에 대한 weight를 adjust 하는 weighted sampler를 도입하여 training process를 향상
결과적으로 zero-shot, few-shot TTS에서 우수한 품질을 달성하면서 빠른 합성이 가능

2. Background: Consistency Models

Diffusion model은 target dataset에 Gaussian noise를 sequentially add 한 다음 reverse denoising process를 수행하는 방식으로 동작함
- 해당 iterative method는 initially noise state에서 sample을 생성하여 data의 intrinsic structure를 효과적으로 capture 하도록 설계됨
- 먼저 time constant $T <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>T</mi></math>$ 에 대해 noisy data sequence $\{x\}_{t\in [0,T]$가 있다고 하자
  1. 그러면 diffusion process는 다음의 stochastic differential equation (SDE)를 통해 나타낼 수 있음:
    (Eq. 1) $x t = μ (x t, t) d t + σ (t) d W t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mi>μ</mi><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>t</mi><mo stretchy="false">)</mo><mi>d</mi><mi>t</mi><mo>+</mo><mi>σ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mi>d</mi><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">W</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$
  2. 여기서 $p 0 (x) \equiv p d a t a (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mo stretchy="false">)</mo><mo>\equiv</mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>a</mi><mi>t</mi><mi>a</mi></mrow></msub><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mo stretchy="false">)</mo></math>$ 이고, $p T (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mo stretchy="false">)</mo></math>$ 는 Gaussisan distribution으로 근사됨
    - $t \in [0, T] <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi><mo>\in</mo><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mi>T</mi><mo stretchy="false">]</mo></math>$ : forward diffusion time step의 index
    - $μ (., .), σ (.) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>μ</mi><mo stretchy="false">(</mo><mo>.</mo><mo>,</mo><mo>.</mo><mo stretchy="false">)</mo><mo>,</mo><mi>σ</mi><mo stretchy="false">(</mo><mo>.</mo><mo stretchy="false">)</mo></math>$ : 각각 drift, diffusion coefficient
    - ${w t} t \in [0, T] <math xmlns="http://www.w3.org/1998/Math/MathML"><mo fence="false" stretchy="false">{</mo><msub><mi>w</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><msub><mo fence="false" stretchy="false">}</mo><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>\in</mo><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mi>T</mi><mo stretchy="false">]</mo></mrow></msub></math>$ : standard Brownian motion
- 이때 SDE는 probability flow ODE 형태로 나타나는 well-defined reverse process를 가지고 있으므로, time $t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi></math>$ 에서 sampling 된 trajectory는 $p t (x t) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo></math>$ 에 대한 distribution을 따름:
  (Eq. 2) $dxt=[μ(xt,t)−12σ(t)2∇logpt(xt)]dt<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><mi>μ</mi><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>t</mi><mo stretchy="false">)</mo><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><mi>σ</mi><mo stretchy="false">(</mo><mi>t</mi><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup><mi mathvariant="normal">∇</mi><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo><mo data-mjx-texclass="CLOSE">]</mo></mrow><mi>d</mi><mi>t</mi></math>$
  - $\nabla p t (x t) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">\nabla</mi><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo></math>$ : score function
- Diffusion model에서 forward step은 noise level에 따라 data distribution에서 멀어지도록 shift를 유도하고, backward step은 sample을 expected data distribution에 가까워지도록 guide 함
  1. 특히 sample generation을 위한 (Eq. 2)의 probability flow ODE는 score function $\nabla log p t (x t) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">\nabla</mi><mi>log</mi><mo data-mjx-texclass="NONE"></mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo></math>$ 를 활용함
  2. 이때 score function을 얻기 위해서는 denoising error $| | f (x t, t) - x | | 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">|</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>f</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>t</mi><mo stretchy="false">)</mo><mo>-</mo><mi>x</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msup><mo stretchy="false">|</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup></math>$ 를 최소화해야 함:
    (Eq. 3) $∇logpt(xt)=(f(xt,t)−xt)σ(t)2<math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">∇</mi><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mfrac><mrow><mo stretchy="false">(</mo><mi>f</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>t</mi><mo stretchy="false">)</mo><mo>−</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo></mrow><mrow><mi>σ</mi><mo stretchy="false">(</mo><mi>t</mi><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup></mrow></mfrac></math>$
    - $f (x t, t) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>t</mi><mo stretchy="false">)</mo></math>$ : step $t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi></math>$ 에서 sample $x t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 를 refine 하는 denoiser function
  3. 한편으로 probability flow ODE에 대한 sampling은 2-step approach를 따름
    - 먼저 noise distribution에서 sample을 추출하고,
    - 다음으로 Euler / Heun과 같은 numerical ODE solver를 사용하여 denoising process를 수행함
- BUT, 위와 같은 ODE solver를 통한 sampling process는 상당한 iteration이 필요하므로, 추론 속도가 느림
  1. 따라서 diffusion model의 sampling을 가속화하기 위해, 모든 time step $t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi></math>$ 와 solution trajectory의 $t' <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>t</mi><mo data-mjx-alternate="1">'</mo></msup></math>$ 에 대해 다음의 consistency property를 도입함:
    (Eq. 4) $f (x t, 0) = f (x t', t'), f (x t, 0) = x 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mn>0</mn><mo stretchy="false">)</mo><mo>=</mo><mi>f</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><msup><mi>t</mi><mo data-mjx-alternate="1">'</mo></msup></mrow></msub><mo>,</mo><msup><mi>t</mi><mo data-mjx-alternate="1">'</mo></msup><mo stretchy="false">)</mo><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>f</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mn>0</mn><mo stretchy="false">)</mo><mo>=</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$
    - 위의 condition이 주어지면, ODE의 sampling trajectory를 따르는 각 point들이 origin $p 0 (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></math>$ 와 직접 연관되므로 one-step sampling $f (x T, T) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub><mo>,</mo><mi>T</mi><mo stretchy="false">)</mo></math>$ 가 가능해짐
  2. 한편으로 consistency model은 consistency training이나 pre-trained diffusion-based teacher model로부터의 distillation을 사용하여 구축될 수 있음
    - Distillation 방식은 teacher model에 종속되므로 TTS pipeline이 복잡해짐
    - 따라서 CM-TTS는 consistency training을 기반으로 consistency model을 설계함

3. CM-TTS

Diffusion model은 TTS에서 우수한 합성 품질을 얻을 수 있지만, 느린 sampling으로 인해 real-time으로 동작하기 어려움
- CM-TTS는 consistency model 도입을 통해 해당 문제를 극복하고자 함

- Model Overview

CM-TTS는 크게 4가지 component로 구성됨
1. Phoneme encoder : text processing을 수행하는 역할
2. Variance adaptor : pitch, duration, energy feature를 예측하는 역할
3. CM-decoder : mel-spectrogram generation을 수행하는 역할
4. Vocoder : HiFi-GAN을 기반으로 mel-spectrogram을 time-domain waveform으로 변환하는 역할

(a) CM-TTS architecture (b) Decoder Training Scheme (c) ODE trajectory

- Phoneme Encoder and Variance Adaptor

Phoneme encoder는 transformer block을 활용한 feed-forward network를 adapt 하여 phoneme sequence 내의 local dependency를 capture 함
- Variance adaptor는 FastSpeech2를 따라 convolution block으로 구성된 pitch, energy, duration predictor로 구성됨
  - 이때 training을 위해 ground-truth에 대한 Mean Squared Error (MSE) loss $L d u r a t i o n, L p i t c h, L e n e r g y <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>u</mi><mi>r</mi><mi>a</mi><mi>t</mi><mi>i</mi><mi>o</mi><mi>n</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>i</mi><mi>t</mi><mi>c</mi><mi>h</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>e</mi><mi>n</mi><mi>e</mi><mi>r</mi><mi>g</mi><mi>y</mi></mrow></msub></math>$ 를 사용
- Training phase에서 ground-truth duration은 phoneme encoder의 hidden sequence를 확장하여 frame-level hidden sequence를 생성한 다음, ground-truth pitch information을 통합하여 사용됨
  - 추론 시에는 예측된 duration과 pitch value를 활용

- Consistency Models

Time hoizon $[ϵ, T max <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">[</mo><mi>ϵ</mi><mo>,</mo><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mo data-mjx-texclass="OP" movablelimits="true">max</mo></mrow></msub><mo stretchy="false">]</mo></math>$ 내의 division을 위해, interval은 $t_{1} = ϵ < t_{2} < . . . < t_{N} = T_{max}$ boundary의 $N - 1$ sub-interval로 segment 됨
- 이때 EDM을 따라 numerical instability를 완화하기 위해 $ϵ = 0.002$ 의 small positive value로 설정하고, $T_{max} = 80$ 으로 설정함
- 그러면 mel-spectrogram을 $x$ 라 하고, $x_{0}$ 를 noise가 추가되지 않은 initial mel-spectrogram이라고 하자
  1. Consistency model $f_{θ}$ 를 formulate 하기 위해서는 (Eq. 4)에서 정의된 self-consistency property를 적용하여 data에서 consistency function을 학습해야 함
  2. 이때 $f_{θ} (x_{0}, ϵ) = x_{0}$ 을 보장하기 위해, consistency model $f_{θ}$ 는 다음과 같이 parameterize 됨:
    (Eq. 5) $f_{θ} (x, t) = c_{s k i p} (t) x + c_{o u t} (t) F_{θ} (x, t)$
    - $c_{s k i p}, c_{o u t}$ : differentiable function으로써, $c_{s k i p} (ϵ) = 1, c_{o u t} (ϵ) = 0$
    - $F_{θ} (x, t)$ : neural network
- Self-consistency property를 enforce 하기 위해 target model $θ^{-}$ 는 online network $θ$ 와 concurrent 하게 유지됨
  1. 이때 target network $θ^{-}$ 의 weight는 학습을 위해 intend 된 parameter $θ$ 에 대한 Exponential Moving Average (EMA)로 update 됨:
    (Eq. 6) $θ^{-} \leftarrow stopgrad (μ θ^{-} + (1 - μ) θ)$
  2. 결과적으로 consistency loss $L_{C T}^{N} (θ, θ^{-})$ 는:
    (Eq. 7) $\sum_{n \geq 1} E [λ (t_{n}) d (f_{θ} (x_{t + 1}), f_{θ^{-}} (x_{t}))]$
    - $d (\cdot, \cdot)$ : squared $l_{2}$ distance $d (x, y) = | | x - y | |_{2}^{2}$ 와 같이 두 sample 간의 distance를 나타내는 function
- $x_{t + 1}$ 과 $x_{t}$ 는 training data $x_{0} \sim D (dataset)$ 의 mel-spectrogram에서 시작하여 forward diffusion process를 적용한 다음, probability flow ODE의 trajectory를 따라 두 point를 sampling 하여 얻어짐:
  (Eq. 8) $x_{t + 1} = x_{0} + t_{n + 1} z, x_{t} = x_{0} + t_{n} z$
  - $z \sim N (0, I)$
  - 여기서 step $t_{n}$ 은:
  (Eq. 9) $t_{n} = {[{T_{max}}^{\frac{1}{p}} + \frac{n - 1}{N - 1} (ϵ^{\frac{1}{p}} - {T_{max}}^{\frac{1}{p}})]}^{p}$
  - $N$ : sub-interval, $n$ : weighted sampling strategy와 $p = 7$ 을 사용하여 interval $[1, N - 1]$ 에서 sampling 되는 값
- DiffGAN-TTS와 마찬가지로 CM-TTS의 $F_{θ} (x, t)$ architecture는 non-causal WaveNet structure를 활용함
  1. 대신 sampling과정에서 $t$ 에 대한 approach를 다르게 사용하고, CM-TTS는 동일한 architecture를 가진 2개의 decoder $f_{θ}, f_{θ^{-}}$ 가 각각 online, target network 역할을 하도록 함
  2. 추가적으로 CM-TTS의 diffusion process는 (Eq. 8)로 characterize 되지만, DiffGAN-TTS는 parameter-free $T$ -step Markov chain을 사용함

- Training and Loss

논문에서는 online $f_{θ}$ 와 target $f_{θ^{-}}$ 의 2개의 decoder를 사용하여 training 함
- 먼저 $x_{t + 1}, x_{t}$ 의 state를 활용하여 online, target network를 통해 각각 $f_{θ} (x_{0} + t_{n + 1} z), f_{θ^{-}} (x_{0} + t_{n} z)$ 로 express 되는 mel prediction을 얻음
  - Online component는 해당 prediction pair 간의 MSE loss를 통해 gradient update를 수행하고, 동시에 target network의 gradient는 EMA를 통해 update 됨
- Training 중에 online/target network는 iterative interplay에 engage 하여 mutual learning과 model stability에 기여함
  1. 이때 mel reconsturction loss $L_{m e l}$ 은 ground-truth와 생성된 mel-spectrogram 간의 Mean Absolute Error (MAE)로 계산됨
  2. 그러면 $L_{r e c o n}$ 은:
    (Eq. 10) $L_{r e c o n} = L_{m e l} (x_{0}, {\hat{x}}_{0}) + λ_{d} L_{d u r a t i o n} (d, \hat{d}) + λ_{p} L_{p i t c h} (p, \hat{p}) + λ_{e} L_{e n e r g y} (e, \hat{e})$
    - $h, p, e$ : ground-truth duration, pitch, energy
    - $\hat{h}, \hat{p}, \hat{e}$ : predicted value
    - $λ_{d}, λ_{p}, λ_{e}$ : 각 loss component에 대한 weight로, 논문에서는 0.1로 설정
  3. 결과적으로 CM-TTS training을 위한 optimization objective는:
    (Eq. 11) $L_{C M - T T S} = L_{C T}^{N} (θ, θ^{-}) + L_{r e c o n}$
- 추론 과정에서 $f_{θ}$ 를 통한 single forward pass를 수행하여 single-step generation이 가능함
  - 한편으로 아래 그림과 같이 denoising과 noise injection step을 altering 하여 품질을 더욱 향상할 수 있는 multi-step generation을 수행할 수도 있음

- Weighted Sampler

Training procedure는 (Eq. 9)에서 정의된 time step $t_{n}$ 에 대한 sampling에 의존함
- 결과적으로 논문은 ODE trajectory를 따라 다양한 position ( $t_{n}$ )을 sampling 하는 것의 영향력을 알아보기 위해, 3가지의 sampling strategy를 비교함
  - 여기서 각 strategy는 training 전반에 걸쳐 $t_{n}$ step selecting과 관련된 probability를 관리함
- 먼저 training 시 forward diffusion process에서 variable $n$ 은 sampling point의 index로써 $n \in [1, N - 1]$ 이고, $t_{n}$ 을 계산하기 위해 (Eq. 9)에서 사용됨
  - 그러면 sampler에 의해 current index $n$ 에 할당된 weight를 $c_{n}$ 이라 하고, index $n$ 을 selecting 할 probability $s_{n} = \frac{c_{n}}{\sum_{i = 1}^{N - 1} c_{n}}$ 으로 나타낼 수 있음
- 결과적에서 논문에서 비교하는 3가지 sampler design은:
  1. Uniform Sampler
    - 각 point가 equal probability $c_{n} = 1$ 로 choice 되는 baseline sampler
  2. Linear Sampler
    - Sampling weight가 sampling point에 따라 linear 하게 변화하는 sampler
    - $c_{n} = α \cdot n$ 으로 정의되고, 논문에서는 $α = 1$ 로 설정
  3. Importance Sampler (IS)
    - Sampling point에 weight를 할당하는 방식으로써, $c_{n} = (1 - ϕ) \frac{\sum_{j = 1}^{H} L (t, j)}{\sum_{i = 1}^{N - 1} \sum_{j = 1}^{H} L (i, j)} + ϕ$ 로 정의됨
    - $L \in R^{(N - 1) \times H}$ 는 모든 sampling point에 대한 historical loss를 recording 하는 matrix이고, $H$ 는 각 point에 store 된 historical loss 수 (논문에서는 10으로 설정)
    - $ϕ$ : $c_{n}$ 을 adjusting 하는 balancing factor
    - 해당 IS는 historical loss를 기반으로 current sampling probability를 adjust 하여 model training에 중요한 point들을 prioritizing 함

4. Experiments

- Settings

Dataset : VCTK, LJSpeech
Comparisons : FastSpeech2, VITS, DiffSpeech, DiffGAN-TTS

- Results

Comparison with Baselines
- CM-TTS는 기존 모델들과 비교하여 가장 우수한 성능을 보임
- 특히 CM-TTS는 single-step generation ( $T = 1$ )에서도 뛰어난 성능을 달성함

Few-step Speech Generation
- Single step에 대해 CM-TTS는 DiffGAN-TTS 보다 지속적으로 우수한 성능을 보임
- 한편으로 $T = 4$ 의 multi-step synthesis로 확장하여도 높은 성능을 보임

Ablation Study
- Ablation study 측면에서 CT와 IS를 각각 제거해 보면, WER 측면에서 성능 저하가 크게 발생함
- 즉, 각 component는 모델 성능 향상에 유효함

Length Robustness During Training
- Training에서 variable-length sequence를 가지는 경우, loss 계산에 padding을 통합하여 사용할 수 있음
  - 이를 통해 input data와 padded segment 모두에서 meaningful representation을 capture 하도록 함
- 결과적으로 아래 표와 같이 loss 계산에서 padding을 포함하면 모델의 성능이 개선됨

The Impact of Weighted Sampler
- CM-TTS에 다양한 sampling method가 미치는 영향을 알아보면, IS sampler를 사용할 때 가장 높은 성능을 얻을 수 있음
- 한편으로 다른 sampler를 사용하더라도 CM-TTS의 수렴에는 큰 영향을 주지 않음

IS sampler의 generalization 측면에서, DiffGAN-TTS에 IS를 적용해 보면 마찬가지로 성능 개선의 효과를 얻을 수 있음

Generalization to Unseen Speakers
- Unseen speaker에 대한 zero-shot 성능을 확인해 보면, CM-TTS의 성능이 가장 우수한 것으로 나타남
- 추가적으로 multi-speaker scenario에서도 CM-TTS는 DiffGAN-TTS 보다 뛰어남

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] CrossSpeech: Speaker-Independent Acoustic Representation for Cross-Lingual Speech Synthesis (0)	2024.05.27
[Paper 리뷰] DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs (0)	2024.05.26
[Paper 리뷰] DurIAN-E2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis (0)	2024.05.23
[Paper 리뷰] DETS: End-to-End Single-Stage Text-to-Speech via Hierarchical Diffusion GAN Models (0)	2024.05.16
[Paper 리뷰] StyleSpeech: Self-Supervised Style Enhancing with VQ-VAE-based Pre-training for Expressive Audiobook Speech Synthesis (0)	2024.05.15

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models

CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models

1. Introduction

2. Background: Consistency Models

3. CM-TTS

- Model Overview

- Phoneme Encoder and Variance Adaptor

- Consistency Models

- Training and Loss

- Weighted Sampler

4. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역