[Paper 리뷰] IST-TTS: Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge

티스토리 뷰

Paper/TTS

[Paper 리뷰] IST-TTS: Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge

feVeRin 2024. 5. 5. 09:21

IST-TTS: Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge

Text-to-Speech에서 style transfer는 중요해지고 있음
IST-TTS
- Variational autoencoder (VAE)와 diffusion refiner를 결합하여 refined mel-spectrogram을 얻음
  - 이때 audio 품질과 style transfer 성능을 향상하기 위해 two-stage, one-stage system을 각각 설계함
- Quantized VAE의 diffusion bridge를 통해 complex discrete style representation을 학습하고 transfer 성능을 향상
- 더 나은 transfer 성능을 위해 ControlVAE를 도입해 reconstruction 품질을 향상하면서 interpretability를 확보
논문 (INTERSPEECH 2023) : Paper Link

1. Introduction

최근의 Text-to-Speech (TTS)는 controllability, expressiveness의 향상에 목표를 두고 있음
- TTS 모델들은 다양한 방식으로 controllable style attribute를 반영하고 있음
  - FastSpeech는 autoregressive 모델을 teacher로 사용하여 duration predictor를 학습하고 length regulator를 통해 duration information을 control 함
  - FastSpeech2는 Montreal Forced Aligner (MFA)를 활용하고, style control을 위해 supervised 방식으로 pitch, energy predictor를 학습함
- 이때 TTS pipeline은 일반적으로 intermediate representation을 생성하는 acoustic model과 raw waveform을 합성하는 vocoder로 구성됨
  - VITS와 같이 VAE에서 생성된 latent variable을 연결하여 end-to-end 방식으로 동작할 수도 있음
- 한편으로 style transfer를 위해서는 강력한 style encoder를 설계할 수 있어야 함
  - 이를 위해 Meta-StyleSpeech는 style adaptive layer norm과 meta-learning을 활용하고, STYLER는 information bottleneck을 통한 speech decomposition을 사용하여 style factor를 반영함
  - GenerSpeech의 경우 multi-level style adaptor와 generalizable content adaptor를 사용
- BUT, 위와 같은 방법들은 style interpretability가 떨어짐

-> 그래서 TTS 작업에서 더 나은 style representation과 interpretable disentangled style latent space를 제공할 수 있는 IST-TTS를 제안

IST-TTS
- VAE-based style encoder를 채택하여 interpretable latent space에 access 하고 diffusion probabilistic model (DPM)을 결합하여 over-smoothing 문제를 극복
- 생성된 style representation의 diversity를 향상하기 위해 Quantized VAE의 diffusion bridge를 도입
- 추가적으로 더 나은 reconstruction 품질과 interpretability를 위해 ControlVAE를 사용

< Overall of IST-TTS >

Refined mel-spectrogram을 얻기 위해 VAE와 DPM을 결합한 TTS 모델
더 나은 style transfer를 위해 latent space에서 style representation의 diversity를 모델링하는 Quantized VAE의 diffusion bridge를 도입
ControlVAE를 통해 기존의 VAE 보다 더 나은 reconstruction ability를 달성하고 결과적으로 우수한 품질과 style interpretability를 달성

2. Background

- Diffusion Probabilistic Models

Denoising Diffusion Probabilistic Model (DDPM)은 음성 합성에서 뛰어난 성능을 보이고 있음
- Diffusion process와 reverse process는 diffusion probabilistic model로 주어지고, 이때 data distribution을 학습하기 위해 denoising network $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$ 를 활용할 수 있음
- Data distribution을 $q (x 0) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>q</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo></math>$ , $x 1, . . ., x T <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub></math>$ 를 동일한 dimension을 가지는 variable sequence라 하자
  1. 그러면 diffusion process는 data $x 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 에서 latent variable $x T <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub></math>$ 까지의 fixed Markov chain으로 정의됨:
    (Eq. 1) $q (x t | x t - 1) = N (x t; \sqrt 1 - β t x t - 1, β t I) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>q</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>-</mo><mn>1</mn></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>;</mo><msqrt><mn>1</mn><mo>-</mo><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>-</mo><mn>1</mn></mrow></msub><mo>,</mo><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mi>I</mi><mo stretchy="false">)</mo></math>$
    (Eq. 2) $q (x 1, . . ., x T | x 0) = \prod T t = 1 q (x t | x t - 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>q</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><munderover><mo data-mjx-texclass="OP">\prod</mo><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></munderover><mi>q</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>-</mo><mn>1</mn></mrow></msub><mo stretchy="false">)</mo></math>$
  2. Reverse process는 shared $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$ 에 의해 parameterize 된 $x T <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub></math>$ 에서 $x 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 까지의 Markov chain으로, Gaussian noise에서 sample을 recover 하는 것을 목표로 함:
    (Eq. 3) $p θ (x t - 1 | x t) = N (x t - 1; μ θ (x t, t), σ 2 t I) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>-</mo><mn>1</mn></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>-</mo><mn>1</mn></mrow></msub><mo>;</mo><msub><mi>μ</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>t</mi><mo stretchy="false">)</mo><mo>,</mo><msubsup><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup><mi>I</mi><mo stretchy="false">)</mo></math>$
    (Eq. 4) $p θ (x 0, . . ., x T - 1 | x T) = \prod T t = 1 p θ (x t - 1 | x t) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>T</mi><mo>-</mo><mn>1</mn></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><munderover><mo data-mjx-texclass="OP">\prod</mo><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></munderover><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>-</mo><mn>1</mn></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo></math>$
    - $α t = 1 - β t, ˉ α t = \prod t t = 1 α t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>α</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mn>1</mn><mo>-</mo><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><msub><mrow data-mjx-texclass="ORD"><mover><mi>α</mi><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><munderover><mo data-mjx-texclass="OP">\prod</mo><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></munderover><msub><mi>α</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$
    - $μ θ, σ 2 t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>μ</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo>,</mo><msubsup><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup></math>$ : 각각 Gaussian의 평균, 표준편차
- 최종적으로 얻어지는 training objective는:
  (Eq. 5) $L D D P M = E t, x 0, ϵ [| | ϵ - ϵ θ (\sqrt ˉ α t x 0 + \sqrt 1 - ˉ α t ϵ, t) | | 22] <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>D</mi><mi>D</mi><mi>P</mi><mi>M</mi></mrow></msub><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>,</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>,</mo><mi>ϵ</mi></mrow></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>ϵ</mi><mo>-</mo><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><msqrt><msub><mrow data-mjx-texclass="ORD"><mover><mi>α</mi><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>+</mo><msub><msqrt><mn>1</mn><mo>-</mo><mrow data-mjx-texclass="ORD"><mover><mi>α</mi><mo stretchy="false">¯</mo></mover></mrow></msqrt><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mi>ϵ</mi><mo>,</mo><mi>t</mi><mo data-mjx-texclass="CLOSE">)</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msubsup><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>$
  - $ϵ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ϵ</mi></math>$ : Gaussian noise, $ϵ θ (\cdot) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mo>\cdot</mo><mo stretchy="false">)</mo></math>$ : model output
- Sampling 시에는 다음의 formulation을 사용해 sampling 함:
  (Eq. 6) $xt−1=1√αt(xt−βt√1−ˉαtϵθ(xt,t))+σtz<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>−</mo><mn>1</mn></mrow></msub><mo>=</mo><mfrac><mn>1</mn><msqrt><msub><mi>α</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt></mfrac><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>−</mo><mfrac><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><msub><msqrt><mn>1</mn><mo>−</mo><mrow data-mjx-texclass="ORD"><mover><mi>α</mi><mo stretchy="false">¯</mo></mover></mrow></msqrt><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mfrac><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>t</mi><mo stretchy="false">)</mo><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo>+</mo><msub><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mi>z</mi></math>$
  - $ϵ \sim N (0, I), p z = N (z; 0, I) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ϵ</mi><mo>\sim</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mi>I</mi><mo stretchy="false">)</mo><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>z</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><mi>z</mi><mo>;</mo><mn>0</mn><mo>,</mo><mi>I</mi><mo stretchy="false">)</mo></math>$
  - $σt=√1−ˉαt−11−ˉαtβt<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><msqrt><mfrac><mrow><mn>1</mn><mo>−</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>α</mi><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>−</mo><mn>1</mn></mrow></msub></mrow><mrow><mn>1</mn><mo>−</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>α</mi><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mrow></mfrac><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt></math>$
- 결과적으로 모든 time step에 걸쳐 iterative sampling을 통해 final data distribution $p (x 0) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 를 얻음

- Variational AutoEncoder

Variational AutoEncoder (VAE)에서 observed data distribution $p (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></math>$ 는 random latent variabel $z <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>z</mi></math>$ 의 random process로 생성된다고 하자
- 여기서 true posterior distribution $p θ (z | x) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>z</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo stretchy="false">)</mo></math>$ 는 undifferentiable marginal likelihood $p θ (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></math>$ 로 인해 intractable 함
- 이를 해결하기 위해 $q ϕ (z | x) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow></msub><mo stretchy="false">(</mo><mi>z</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo stretchy="false">)</mo></math>$ 를 true posterior distribution $p θ (z | x) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>z</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo stretchy="false">)</mo></math>$ 에 대한 근사로 도입하면, $log p θ (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>log</mi><mo data-mjx-texclass="NONE"></mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></math>$ 에 대한 다음의 formulation을 얻을 수 있음:
  (Eq. 7) $logpθ(x)≥Eqϕ(z|x)[logpθ(x,z)qϕ(z|x)]=Eqϕ(z|x)[logpθ(x|z)]−DKL(qϕ(z|x)||pθ(z))<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>≥</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mrow data-mjx-texclass="ORD"><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow></msub><mo stretchy="false">(</mo><mi>z</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo stretchy="false">)</mo></mrow></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><mfrac><mrow><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><mi>z</mi><mo stretchy="false">)</mo></mrow><mrow><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow></msub><mo stretchy="false">(</mo><mi>z</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo stretchy="false">)</mo></mrow></mfrac><mo data-mjx-texclass="CLOSE">]</mo></mrow><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mrow data-mjx-texclass="ORD"><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow></msub><mo stretchy="false">(</mo><mi>z</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo stretchy="false">)</mo></mrow></msub><mo stretchy="false">[</mo><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>z</mi><mo stretchy="false">)</mo><mo stretchy="false">]</mo><mo>−</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>K</mi><mi>L</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow></msub><mo stretchy="false">(</mo><mi>z</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>z</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$
- VAE의 encoder는 diagonal covariance를 가지는 multivariate Gaussian을 모델링하고, 이때 prior는 standard multivariate Gaussian을 사용함:
  (Eq. 8) $q ϕ (z | x) = N (z; μ ϕ (x), σ 2 ϕ (x) I) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow></msub><mo stretchy="false">(</mo><mi>z</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><mi>z</mi><mo>;</mo><msub><mi>μ</mi><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>,</mo><msubsup><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mi>I</mi><mo stretchy="false">)</mo></math>$
  (Eq. 9) $p z = N (z; 0, I) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>z</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><mi>z</mi><mo>;</mo><mn>0</mn><mo>,</mo><mi>I</mi><mo stretchy="false">)</mo></math>$
  - $q ϕ (z | x) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow></msub><mo stretchy="false">(</mo><mi>z</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo stretchy="false">)</mo></math>$ 의 $μ θ, σ 2 (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>μ</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo>,</mo><msup><mi>σ</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></math>$ 는 neural network를 통해 학습되고, non-derivable 문제를 해결하기 위해 reparameterization trick이 VAE에 도입됨
- 결과적으로 각 $z <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>z</mi></math>$ 는 input $x <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi></math>$ 와 auxiliary noise variable $ϵ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ϵ</mi></math>$ 의 deterministic function으로써 계산됨:
  (Eq. 10) $z = μ ϕ (x) + σ ϕ (x) ⊙ ϵ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>z</mi><mo>=</mo><msub><mi>μ</mi><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>+</mo><msub><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>⊙</mo><mi>ϵ</mi></math>$
  - $⊙ <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>⊙</mo></math>$ : element-wise product

- Quantized VAE

Quantized VAE는 기존 VAE encoder의 representation ability를 향상하기 위해 도입됨
- 먼저 Quantized VAE는 VAE output $z <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>z</mi></math>$ 에 discrete codebook component를 추가하여 VAE를 확장함
  - 이때 $z <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>z</mi></math>$ 는 codebook의 모든 vector와 compare 되고, 가장 가까운 codebook vector가 VAE decoder로 전달됨
- 여기서 commitment loss와 codebook loss로 구성된 vector quantization loss는:
  (Eq. 11) $L Q = | | s g [z] - q | | 22 + γ | | z - s g [q] | | 22 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>Q</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>s</mi><mi>g</mi><mo stretchy="false">[</mo><mi>z</mi><mo stretchy="false">]</mo><mo>-</mo><mi>q</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msubsup><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup><mo>+</mo><mi>γ</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>z</mi><mo>-</mo><mi>s</mi><mi>g</mi><mo stretchy="false">[</mo><mi>q</mi><mo stretchy="false">]</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msubsup><mo stretchy="false">|</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup></math>$
  - $z <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>z</mi></math>$ : VAE output, $q <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>q</mi></math>$ : codebook vector, $γ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>γ</mi></math>$ : commitment loss weight, $s g [\cdot] <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>s</mi><mi>g</mi><mo stretchy="false">[</mo><mo>\cdot</mo><mo stretchy="false">]</mo></math>$ : stop gradient operation

3. Method

IST-TTS는 diffusion refiner, diffusion bridge, ControlVAE로 구성됨

- Model Architecture

먼저 reference mel-spectrogram은 style information을 추출하기 위해 reference encoder에 제공되고, style information은 ControlVAE를 통과하여 interpretable latent space $Z <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>Z</mi></math>$ 를 얻음
- Quantized embedding은 diffusion bridge에 의해 얻어진 다음 acoustic model에 전달됨
  - 이때 IST-TTS는 다양한 style representation을 학습하기 위해 DiffWave의 diffusion bridge architecture를 활용함
- Acoustic model은 FastSpeech architecture를 기반으로 하고, Diffusion refiner는 VAE와 DPM을 결합하여 설계됨
  - 추가적으로 speaker embedding은 x-vector를 통해 추출됨
  - Duration predictor를 training 하고 distillation 과정을 대체하기 위해 MFA를 활용함

- Diffusion Refiner

VAE와 DPM을 각각 two-stage pipeline, one-stage pipeline으로 통합할 수 있음
1. Two-stage and One-tage Training Pipeline
  - Two-stage pipeline에서 model은 먼저 intermediate mel-spectrogram을 생성하고, vocoder에 공급되어 waveform을 얻음
    - 논문에서는 해당 모델을 VAEFS라고 함
  - Linear layer로 처리된 indermediate mel-spectrogram을 diffusion model의 condition으로 사용하여 diffusion refiner에 제공할 수도 있음
    - 논문에서는 해당 모델을 VAEFS+2s라고 함
  - 한편으로 One-tage piepline에서는 Diff-TTS를 따라 acoustic model을 구성할 수 있음
    - 논문에서는 이를 VAEFS+1s라고 함
2. Conditional Diffusion Model
  - IST-TTS의 diffusion refiner는 conditional diffusion model임
    - Input이 diffusion model의 condition이 되기 위해서 external intermediate mel-spectrogram이나 decoder input이 필요하기 때문
  - 이때 training objective는:
    (Eq. 12) $L R = E t, x 0, ϵ, c [| | ϵ - ϵ θ (\sqrt ˉ α t x 0 + \sqrt 1 - ˉ α t ϵ, t, c) | | 22] <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>R</mi></mrow></msub><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>,</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>,</mo><mi>ϵ</mi><mo>,</mo><mi>c</mi></mrow></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>ϵ</mi><mo>-</mo><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><msqrt><msub><mrow data-mjx-texclass="ORD"><mover><mi>α</mi><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>+</mo><msqrt><mn>1</mn><mo>-</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>α</mi><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt><mi>ϵ</mi><mo>,</mo><mi>t</mi><mo>,</mo><mi>c</mi><mo data-mjx-texclass="CLOSE">)</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msubsup><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>$
    - $c <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>c</mi></math>$ : condition

- Diffusion Bridge

Quantized VAE는 vector quantization을 통해 latent feature를 discretize 하여 보다 expressive 한 sample을 생성함
- IST-TTS는 Quantized VAE의 expressiveness를 더욱 향상하기 위해 새로운 diffusion bridge를 제시
- 구체적으로, complex discrete distribution을 학습하기 위해 VAE output의 $z <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>z</mi></math>$ latent space인 continuous space에서 diffusion model을 사용함
  - Diffusion bridge의 sampling process는 추론 시에만 사용됨
- Diffusion bridge에 대한 training loss는:
  (Eq. 13) $L B = E t, z 0, ϵ [| | ϵ - ϵ θ (\sqrt ˉ α t z 0 + \sqrt 1 - ˉ α t ϵ, t) | | 22] <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>B</mi></mrow></msub><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>,</mo><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>,</mo><mi>ϵ</mi></mrow></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>ϵ</mi><mo>-</mo><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><msqrt><msub><mrow data-mjx-texclass="ORD"><mover><mi>α</mi><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>+</mo><msqrt><mn>1</mn><mo>-</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>α</mi><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt><mi>ϵ</mi><mo>,</mo><mi>t</mi><mo data-mjx-texclass="CLOSE">)</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msubsup><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>$

- ControlVAE

Original VAE는 KL vanishing과 low reconstruction quality의 문제를 겪을 수 있으므로, 이를 해결하기 위해 IST-TTS는 controller와 basic VAE를 결합한 ControlVAE를 도입함
- 특히 non-linear proportional-integral (PI) controller는 training 중에 output KL-divergence를 feedback으로 사용하여 VAE objective에 더해진 weight를 automatically tuning 할 수 있음
- 여기서 PI controller의 weight $β (t) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>β</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></math>$ 는:
  (Eq. 14) $β(t)=Kp1+exp(e(t))−Ki∑tj=0e(j)+βmin<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>β</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>=</mo><mfrac><msub><mi>K</mi><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msub><mrow><mn>1</mn><mo>+</mo><mi>exp</mi><mo data-mjx-texclass="NONE">⁡</mo><mo stretchy="false">(</mo><mi>e</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo></mrow></mfrac><mo>−</mo><msub><mi>K</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>j</mi><mo>=</mo><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></munderover><mi>e</mi><mo stretchy="false">(</mo><mi>j</mi><mo stretchy="false">)</mo><mo>+</mo><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mo data-mjx-texclass="OP" movablelimits="true">min</mo></mrow></msub></math>$
  - $K p, K i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>K</mi><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msub><mo>,</mo><msub><mi>K</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub></math>$ : 각각 porpositional term, integral term의 coefficient
  - $e (t) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>e</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></math>$ : 실제 KL value와 예측된 KL value 간의 error, $β min <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mo data-mjx-texclass="OP" movablelimits="true">min</mo></mrow></msub></math>$ : constant
- 그러면 ControlVAE의 loss는:
  (Eq. 15) $L C = E q ϕ (z | x) [log p θ (x | z)] - β (t) D K L (q ϕ (z | x) | | p (z)) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>C</mi></mrow></msub><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mrow data-mjx-texclass="ORD"><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow></msub><mo stretchy="false">(</mo><mi>z</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo stretchy="false">)</mo></mrow></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><mi>log</mi><mo data-mjx-texclass="NONE"></mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>z</mi><mo stretchy="false">)</mo><mo data-mjx-texclass="CLOSE">]</mo></mrow><mo>-</mo><mi>β</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>K</mi><mi>L</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow></msub><mo stretchy="false">(</mo><mi>z</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>p</mi><mo stretchy="false">(</mo><mi>z</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$
  - Reconstruction loss는 one-stage pipeline에서 DiffSinger와 같이 auxiliary feed-forward transformer decoder를 통해 계산됨
- 결과적으로 IST-TTS의 total training loss는:
  (Eq. 16) $L A l l = L C + L R + L Q + L B <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>A</mi><mi>l</mi><mi>l</mi></mrow></msub><mo>=</mo><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>C</mi></mrow></msub><mo>+</mo><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>R</mi></mrow></msub><mo>+</mo><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>Q</mi></mrow></msub><mo>+</mo><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>B</mi></mrow></msub></math>$

4. Experiments

- Settings

Dataset : LibriTTS
Comparisons : GenerSpeech

- Results

Parallel Style Transfer
- FD, MCD를 비롯한 정량적 metric 측면에서 IST-TTS는 가장 우수한 성능을 보임
- MOS, SMOS 측면에서도 IST-TTS가 가장 뛰어난 성능을 달성함

Non-Parallel Style Transfer
- Non-parallel style transfer는 text가 reference utterance에서 변경되는 경우에 해당함
- Non-paralle style transfer의 경우에도, 제안된 IST-TTS가 가장 우수한 성능을 보임

Ablation Study
- Ablation study 측면에서 IST-TTS의 각 component를 제거하는 경우, 성능 저하가 발생함
- 추가적으로 합성된 mel-spectrogram을 확인해 보면, diffusion refiner는 VAEFS의 over-smoothing 문제를 해결할 수 있는 것으로 나타남

Style Interpretability
- ControlVAE latent space $z <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>z</mi></math>$ 에서 disentangling을 확인해 보면
- 아래 그림과 같이 energy, pitch 등의 다양한 speaking style이 나타남

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] Eden-TTS: A Simple and Efficient Parallel Text-to-Speech Architecture with Collaborative Duration-Alignment Learning (0)	2024.05.08
[Paper 리뷰] MQTTS: A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech (0)	2024.05.07
[Paper 리뷰] ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech (0)	2024.05.04
[Paper 리뷰] PAVITS: Exploring Prosody-Aware VITS for End-to-End Emotional Voice Conversion (0)	2024.05.02
[Paper 리뷰] VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature (0)	2024.04.30

Let IT Begin Voice Engineer | 심심하면 앨범 리뷰 올립니다

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] IST-TTS: Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge

IST-TTS: Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge

1. Introduction

2. Background

- Diffusion Probabilistic Models

- Variational AutoEncoder

- Quantized VAE

3. Method

- Model Architecture

- Diffusion Refiner

- Diffusion Bridge

- ControlVAE

4. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역