[Paper 리뷰] Grad-StyleSpeech: Any-Speaker Adaptive Text-to-Speech Synthesis with Diffusion Models

티스토리 뷰

Paper/TTS

[Paper 리뷰] Grad-StyleSpeech: Any-Speaker Adaptive Text-to-Speech Synthesis with Diffusion Models

feVeRin 2024. 2. 9. 12:43

Grad-StyleSpeech: Any-Speaker Adaptive Text-to-Speech Synthesis with Diffusion Models

Any-speaker adaptive Text-to-Speech 작업은 여전히 target speaker의 style을 모방하기에 만족스럽지 못함
Grad-StyleSpeech
- Diffusion model을 기반으로 하는 any-speaker adaptive Text-to-Speech model
- Few-second reference speech가 주어지면 target speaker와 유사한 음성을 생성하는 것을 목표로 함
논문 (ICASSP 2023) : Paper Link

1. Introduction

Text-to-Speech (TTS)는 single speaker에서 multiple speaker로 확장되고 있음
- 특히 reference 음성이 주어지면 any-speaker의 음성을 합성할 수 있는 any-speaker adaptive TTS에 중점을 둠
- Any-speaker adaptive TTS는 target speaker에 대한 few sample 만을 고려하여 target speaker와 유사한 음성을 합성하는 것을 목표로 함
- Any-speaker adaptive TTS를 위한 이전 연구들을 보면,
  1. 주로 transcribed (supervised) sample을 활용하여 TTS model을 fine-tuning 하는 방식을 사용
    -> BUT, supervised sample을 얻기 어렵고, parameter를 업데이트하는데 많은 비용이 필요함
  2. 그에 비해 zero-shot 방식은 unseen speaker에 대한 fine-tuning 단계가 굳이 필요하지 않음
    -> BUT, generative modeling으로 인해 unseen speaker에 대한 similarity가 낮다는 단점이 있음

-> 그래서 zero-shot any-speaker TTS를 위해 score-based diffusion model을 활용하는 Grad-StyleSpeech를 제안

Grad-StyleSpeech
- Target speaker의 style을 고려하기 위해 style-based generative model을 도입
- Hierarchical transformer encoder를 통해 reverse diffusion process에서 활용 가능한 representative prior noise를 생성
  - 이를 통해 input phoneme을 embedding할 때 target speaker의 style을 반영 가능

< Overall of Grad-StyleSpeech >

Zero-shot 방식을 기반으로 any-speaker TTS를 수행
Score-based diffusion model을 활용하고, any-speaker adaptive setting에 대응하는 hierarchical transformer encoder를 도입
결과적으로 기존의 다른 any-speaker TTS 모델들보다 뛰어난 성능을 달성

2. Method

Speaker adaptive TTS는 target speaker의 text transcription, reference speech를 바탕으로 음성을 생성함
- 이를 위해 mel-spectrogram과 같은 audio feature를 합성해야 함
- Text $x = [x 1, . . ., x n] <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mo>=</mo><mo stretchy="false">[</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mrow data-mjx-texclass="ORD"><mi>n</mi></mrow></msub><mo stretchy="false">]</mo></math>$ 은 phoneme으로 구성되고, reference speech $Y = [y 1, . . ., y m] \in R m \times 80 <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mo>=</mo><mo stretchy="false">[</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi></mrow></msub><mo stretchy="false">]</mo><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mo>\times</mo><mn>80</mn></mrow></msup></math>$ 이라 하자
  - 이때 TTS model의 목표는 ground-truth speech $˜ Y <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mo stretchy="false">~</mo></mover></mrow></math>$ 를 생성하는 것
- Grad-StyleSpeech는 3부분으로 구성됨
  - Reference speech를 style vector에 반영하는 Mel-Style Encoder
  - Text와 style vector로 condition된 representation을 생성하는 Hierarchical Transformer Encoder
  - Denoising step에 따라 mel-spectrogram을 생성하는 Diffusion Model

- Mel-Style Encoder

Mel-style encoder를 사용하여 reference speech를 latent style vector에 embedding 함
- 수식적으로는 $s = h ψ (Y) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow><mo>=</mo><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>ψ</mi></mrow></msub><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mo stretchy="false">)</mo></math>$ 으로 나타낼 수 있음
  - $s \in R d' <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><msup><mi>d</mi><mo data-mjx-alternate="1">'</mo></msup></mrow></msup></math>$ : style vector
  - $h ψ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>ψ</mi></mrow></msub></math>$ : $ψ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ψ</mi></math>$ 에 의해 parameterize 된 mel-style encoder
- 구조적으로 mel-style encoder는
  - Spectral/temporal processor, Transformer layer, Temporal average pooling으로 구성됨

- Score-based Diffusion Model

Diffusion model은 unit Gaussian 분포 $N (0, I) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mi>I</mi><mo stretchy="false">)</mo></math>$ 를 따르는 prior noise 분포에서 sampling 된 noise를 점진적으로 denoising 하여 sample을 생성
- 이때 Grad-StyleSpeech는 Markov chain 대신 Grad-TTS가 채택한 SDE 기반의 denoising process를 활용

Forward Diffusion Process
- Forward diffusion process는 noise 분포 $N (0, I) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mi>I</mi><mo stretchy="false">)</mo></math>$ 에서 얻어진 noise를, sample 분포 $Y 0 \sim p 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>\sim</mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 에서 가져온 sample에 점진적으로 추가하는 과정
- Forward diffusion process에 대한 differential equation을 다음과 같이 정의하자:
  $dYt=−12β(t)Ytdt+√β(t)dWt<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><mi>β</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mi>d</mi><mi>t</mi><mo>+</mo><msqrt><mi>β</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></msqrt><mi>d</mi><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">W</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$
  - $t \in [0, T] <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi><mo>\in</mo><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mi>T</mi><mo stretchy="false">]</mo></math>$ : continuous time step, $β (t) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>β</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></math>$ : noise scheduling function, $W t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">W</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ : standard Wiener process
- 이때 Grad-TTS는 data-driven prior noise 분포 $N (μ, I) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><mi>μ</mi><mo>,</mo><mi>I</mi><mo stretchy="false">)</mo></math>$ 에서 denoising을 수행할 것을 권장:
  (Eq. 1) $dYt=−12β(t)(Yt−μ)dt+√β(t)dWt<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><mi>β</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>−</mo><mi>μ</mi><mo stretchy="false">)</mo><mi>d</mi><mi>t</mi><mo>+</mo><msqrt><mi>β</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></msqrt><mi>d</mi><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">W</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$
  - $μ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>μ</mi></math>$ : neural network의 text/style-conditioned representation
- 그러면 Gaussian 분포를 따르는 transition kernel $p 0 t (Y t | Y 0) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mn>0</mn><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 는 tractable 하므로:
  (Eq. 2) $p 0 t (Y | Y 0) = N (Y t; γ t, σ 2 t), σ 2 t = I - e - \int t 0 β (s) d s Y 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mn>0</mn><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>;</mo><msub><mi>γ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><msubsup><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup><mo stretchy="false">)</mo><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><msubsup><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup><mo>=</mo><mi>I</mi><mo>-</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>-</mo><msubsup><mo data-mjx-texclass="OP">\int</mo><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msubsup><mi>β</mi><mo stretchy="false">(</mo><mi>s</mi><mo stretchy="false">)</mo><mi>d</mi><mi>s</mi></mrow></msup><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$
  $γt=(I−e−12∫t0β(s)ds)μ+e−12∫t0β(s)dsY0<math xmlns="http://www.w3.org/1998/Math/MathML"><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><msub><mi>γ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mo stretchy="false">(</mo><mi>I</mi><mo>−</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msubsup><mo data-mjx-texclass="OP">∫</mo><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msubsup><mi>β</mi><mo stretchy="false">(</mo><mi>s</mi><mo stretchy="false">)</mo><mi>d</mi><mi>s</mi></mrow></msup><mo stretchy="false">)</mo><mi>μ</mi><mo>+</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msubsup><mo data-mjx-texclass="OP">∫</mo><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msubsup><mi>β</mi><mo stretchy="false">(</mo><mi>s</mi><mo stretchy="false">)</mo><mi>d</mi><mi>s</mi></mrow></msup><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$
Reverse Diffusion Process
- Reverse diffusion process는 $p T <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub></math>$ 의 noise를 $p 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 의 data sample로 점진적으로 invert 하는 과정
- 따라서 (Eq. 1)에 대한 reverse process는 reverse-time SDE로써:
  $dYt=[−12β(t)(Yt−μ)−β(t)∇Ytlogpt(Yt)]dt+√β(t)d˜Wt<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><mi>β</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>−</mo><mi>μ</mi><mo stretchy="false">)</mo><mo>−</mo><mi>β</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><msub><mi mathvariant="normal">∇</mi><mrow data-mjx-texclass="ORD"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mrow></msub><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo><mo data-mjx-texclass="CLOSE">]</mo></mrow><mi>d</mi><mi>t</mi><mo>+</mo><msqrt><mi>β</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></msqrt><mi>d</mi><msub><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">W</mi></mrow><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$
  - $˜ W t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">W</mi></mrow><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ : reverse Wiener process, $\nabla Y t log p t (Y t) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi mathvariant="normal">\nabla</mi><mrow data-mjx-texclass="ORD"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mrow></msub><mi>log</mi><mo data-mjx-texclass="NONE"></mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo></math>$ : data 분포 $p t (Y t) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo></math>$ 의 score function
- Reverse SDE를 풀기 위해 numerical SDE solver를 활용하여 noise $Y t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 로부터 sample $Y 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 생성할 수 있음
  - 이때 exact score를 얻기가 어렵기 때문에 neural network $ϵ θ (Y t, t, μ, s) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>t</mi><mo>,</mo><mi>μ</mi><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow><mo stretchy="false">)</mo></math>$ 를 이용하여 score를 추정

- Hierarchical Transformer Encoder

Diffusion model을 활용하는 multi-speaker TTS에서 $μ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>μ</mi></math>$ 는 중요한 역할을 함
- 따라서 encoder가 3-level hierarchy를 가지도록 구성
  1. 먼저 text encoder $f λ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>λ</mi></mrow></msub></math>$ 는 phoneme sequence의 contextual representation을 위해 multiple transformer block을 통해 input text를 hidden representation으로 mapping:
    $H = f λ (x) \in R n \times d <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">H</mi></mrow><mo>=</mo><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>λ</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>n</mi><mo>\times</mo><mi>d</mi></mrow></msup></math>$
  2. 이후 input text $x <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow></math>$ 와 target speech $Y <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow></math>$ 사이의 alignment를 계산하고, text encoder의 output length를 target speech의 length로 regulate 하는 unsupervised alignment learning framework를 적용:
    $A l i g n (H, x, Y) = ˜ H \in R m \times d <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>A</mi><mi>l</mi><mi>i</mi><mi>g</mi><mi>n</mi><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">H</mi></mrow><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mo stretchy="false">)</mo><mo>=</mo><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">H</mi></mrow><mo stretchy="false">~</mo></mover></mrow><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mo>\times</mo><mi>d</mi></mrow></msup></math>$
    - 추가적으로 각 phoneme의 duration을 예측하기 위해 duration predictor를 도입
  3. 마지막으로 speaker-adaptive hidden representation을 얻기 위해 style-adaptive transformer block을 통해 length-regulated embedding sequence를 encoding:
    $μ = g ϕ (˜ H, s) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>μ</mi><mo>=</mo><msub><mi>g</mi><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow></msub><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">H</mi></mrow><mo stretchy="false">~</mo></mover></mrow><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow><mo stretchy="false">)</mo></math>$
    - $s <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow></math>$ : style vector
- 구조적으로는, Style-Adaptive Layer Normalization (SALN)을 통해 style-adaptive encoder의 transformer block에 style information을 반영
  - 이를 통해 hierarchical transformer encoder는 input text $x <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow></math>$ 의 linguistic content와 style vector $s <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow></math>$ 의 style information을 반영한 hidden representation $μ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>μ</mi></math>$ 를 얻음
  - 이렇게 얻어진 $μ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>μ</mi></math>$ 는 denoising diffusion model의 style-conditioned prior noise 분포를 구성하는데 사용됨
- 이때 Grad-TTS와 마찬가지로 prior loss $L p r i o r = | | μ - Y | | 22 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>r</mi><mi>i</mi><mi>o</mi><mi>r</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>μ</mi><mo>-</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msubsup><mo stretchy="false">|</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup></math>$ 를 적용
  - $μ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>μ</mi></math>$ 와 $Y <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow></math>$ 간의 $L 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mn>2</mn></math>$ distance를 최소화하는 방식으로 최적화됨

- Training

Score estimation network $ϵ θ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub></math>$ 를 training 하기 위해
- Tractable transition kernel $p 0 t (Y t | Y 0) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mn>0</mn><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 에 대한 marginalization의 expectation을 계산:
  (Eq. 3)
  - $s <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow></math>$ : style vector, $Y t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ : (Eq. 2)의 Gaussian 분포에서 sample 되는 값
- 이후 exact score computation은:
  (Eq. 4)
  - $σ t = \sqrt 1 - e - \int t 0 β (s) d s <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><msqrt><mn>1</mn><mo>-</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>-</mo><msubsup><mo data-mjx-texclass="OP">\int</mo><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msubsup><mi>β</mi><mo stretchy="false">(</mo><mi>s</mi><mo stretchy="false">)</mo><mi>d</mi><mi>s</mi></mrow></msup></msqrt></math>$ : (Eq. 1) 참고
- 따라서 aligner와 duration predictor training에 대한 $L a l i g n <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>a</mi><mi>l</mi><mi>i</mi><mi>g</mi><mi>n</mi></mrow></msub></math>$ 까지 결합한 최종 training objective는:
  $L = L d i f f + L p r i o r + L a l i g n <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>i</mi><mi>f</mi><mi>f</mi></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>r</mi><mi>i</mi><mi>o</mi><mi>r</mi></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>a</mi><mi>l</mi><mi>i</mi><mi>g</mi><mi>n</mi></mrow></msub></math>$

3. Experiments

- Settings

Dataset : LibriTTS, VCTK
Comparisons : YourTTS, Grad-TTS, Meta-StyleSpeech, AdaSpeech

- Results

Unseen speaker에 대한 zero-shot adaptation 성능을 비교해 보면
- SECS, CER 측면에서 Grad-StyleSpeech의 성능이 가장 우수한 것으로 나타남
- 특히 Grad-StyleSpeech는 Grad-TTS 보다 훨씬 뛰어난 성능을 보임
  - 이는 any-speaker adaptation 성능이 diffusion model 뿐만 아니라 hierarchical transformer encoder에서도 파생된다는 것을 의미

MOS 측면의 주관적 합성 품질 비교에서도 Grad-StyleSpeech가 가장 우수한 성능을 보임

합성된 음성에 대한 mel-spectrogram을 확인해 보면,
- Grad-StyleSpeech는 diffusion model을 활용함으로써 high-frequency component를 detail 하게 모델링함
- 결과적으로 over-smoothing 문제를 극복 가능

Grad-StyleSpeech를 unseen speaker에 대해 fine-tuning 했을 때의 결과를 살펴보면,
- Diffusion model과 Style-adaptive encoder를 100 step으로 fine-tuning 했을 때 AdaSpeech 보다 우수한 성능을 달성
- 이러한 fine-tuning은 speaker similarity를 확보하는데 효과적인 것으로 나타남

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance (0)	2024.02.14
[Paper 리뷰] EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture (0)	2024.02.10
[Paper 리뷰] Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow (0)	2024.02.06
[Paper 리뷰] YourTTS: Toward Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone (0)	2024.02.05
[Paper 리뷰] STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech (0)	2024.01.31

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] Grad-StyleSpeech: Any-Speaker Adaptive Text-to-Speech Synthesis with Diffusion Models

Grad-StyleSpeech: Any-Speaker Adaptive Text-to-Speech Synthesis with Diffusion Models

1. Introduction

2. Method

- Mel-Style Encoder

- Score-based Diffusion Model

- Hierarchical Transformer Encoder

- Training

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역