[Paper 리뷰] DiffVoice: Text-to-Speech with Latent Diffusion

티스토리 뷰

Paper/TTS

[Paper 리뷰] DiffVoice: Text-to-Speech with Latent Diffusion

feVeRin 2024. 1. 25. 13:41

DiffVoice: Text-to-Speech with Latent Diffusion

Text-to-Speech 모델의 성능 향상을 위해 latent diffusion을 활용할 수 있음
DiffVoice
- Adversarial training을 활용한 variational autoencoder를 통해 speech signal을 phoneme-rate representation으로 encode
- Diffusion model을 통한 latent representation과 duration의 joint modelling
논문 (ICASSP 2023) : Paper Link

1. Introduction

Diffusion model은 합성 작업에서 뛰어난 성능을 보이고 있음
- Text-to-Speech (TTS)에서는 acoustic model에 적용되어 text input이 주어졌을 때, log mel-spectrogram을 생성함
- 이때 diffusion model을 사용하여 $x 0 \in R d <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>d</mi></mrow></msup></math>$ 에 대한 data density $p (x 0) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 를 직접 modelling하는 것은 여러 문제점이 있음
  1. Intermediate latent variable $x t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 가 $x 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 와 같은 shape을 가지도록 제한됨
    - Diffusion sampling은 score estimator $s θ (x t, t) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>t</mi><mo stretchy="false">)</mo></math>$ 에 대한 반복적인 평가를 필요로 하기 때문
  2. Diffusion model은 $p (x 0) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 의 모든 mode를 capture 하려고 하므로, imperceptible detail에 대해 많은 modelling capacity를 소비함
- Latent Diffusion Model은 위 문제점들을 완화할 수 있음
  - Encoder $f ϕ (\cdot) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow></msub><mo stretchy="false">(</mo><mo>\cdot</mo><mo stretchy="false">)</mo></math>$ 을 적용하여 data를 latent code $z 0 = f ϕ (x 0) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>=</mo><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 로 encoding 한 다음,
  - Diffusion model로 latent density $p ϕ (z 0) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 를 modelling하여 decoder로 $g ψ (z 0) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>g</mi><mrow data-mjx-texclass="ORD"><mi>ψ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 를 생성 가능

-> 그래서 Latent Diffusion Model을 기반으로한 diffusion TTS model인 DiffVoice를 제안

DiffVoice
- VAE-GAN 기반의 autoencoder를 활용하여 time에 따른 down-sampling을 지원
  - 이를 통해 mel-spectrogram $y \in R N \times D m e l <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>y</mi><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>N</mi><mo>\times</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>e</mi><mi>l</mi></mrow></msub></mrow></msup></math>$ 을 latent code $z 0 \in R M \times D l a t e n t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>M</mi><mo>\times</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>l</mi><mi>a</mi><mi>t</mi><mi>e</mi><mi>n</mi><mi>t</mi></mrow></msub></mrow></msup></math>$ 로 encoding
  - $N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi></math>$ : frame 수, $M <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>M</mi></math>$ : phoneme 수
- Dynamic-rate down-sampling을 통해 latent space에서 single diffusion model을 사용하여 phoneme duration과 mel-spectrogram을 joint modelling
  - 이때 phoneme duration이 다른 factor들과 함께 joint modelling되므로 generic inverse problem solving algorithm을 DiffVoice에 결합할 수 있음

< Overall of DiffVoice >

Adversarial training을 통해 speech signal을 phoneme-rate representation으로 encode
Latent Diffusion Model을 활용하여 latent representation과 phoneme duration 간의 joint modelling을 지원
결과적으로 우수한 TTS 품질을 달성하고 zero-shot 문제로의 확장이 가능

2. DiffVoice

$y \in R N \times D m e l <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>y</mi><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>N</mi><mo>\times</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>e</mi><mi>l</mi></mrow></msub></mrow></msup></math>$ 이 log mel-spectrogram이라고 하자.
- $N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi></math>$ : frame 수, $D m e l <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>e</mi><mi>l</mi></mrow></msub></math>$ : mel filter-bank size
- 이때 $w \in Σ M <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>w</mi><mo>\in</mo><msup><mi mathvariant="normal">Σ</mi><mrow data-mjx-texclass="ORD"><mi>M</mi></mrow></msup></math>$ 은 $Σ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">Σ</mi></math>$ 가 모든 phoneme의 집합이라고 했을 때, 해당하는 phoneme sequence

- Dynamic Down-Sampling of Speech

DiffVoice는 Variational AutoEncoder (VAE)를 사용하여 speech를 compact latent space로 encoding 함
- DiffVoice는 $w <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>w</mi></math>$ 와 $y <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>y</mi></math>$ 사이의 alignment를 얻기 위해 phoneme sequence에 대해 학습된 CTC based ASR model을 활용
  - Minimal-CTC를 사용하여 각 phoneme에 대해 단 하나의 sharp spike가 생성되도록 보장
- $w = (w i) M i = 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>w</mi><mo>=</mo><mo stretchy="false">(</mo><msub><mi>w</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><msubsup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>M</mi></mrow></msubsup></math>$ 의 각 $w i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>w</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub></math>$ 에 대해 CTC-alignment의 position은, $a = (a i) M i = 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>a</mi><mo>=</mo><mo stretchy="false">(</mo><msub><mi>a</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><msubsup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>M</mi></mrow></msubsup></math>$ 의 $a i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>a</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub></math>$
  - $a <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>a</mi></math>$ 는 strictly increasing 하므로 $a 0 := 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>a</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>:=</mo><mn>0</mn></math>$ , $d i := (a i - a i - 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo>:=</mo><mo stretchy="false">(</mo><msub><mi>a</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo>-</mo><msub><mi>a</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>-</mo><mn>1</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 로 둘 수 있음
  - Positive sequence $d = (d i) M i = 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi><mo>=</mo><mo stretchy="false">(</mo><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><msubsup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>M</mi></mrow></msubsup></math>$ 은 phoneme duration을 포함
- 근사 posterior $q ϕ (z 0 | y, a) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>y</mi><mo>,</mo><mi>a</mi><mo stretchy="false">)</mo></math>$ 는 다음과 같이 정의됨:
  1. Encoder Conformer에 의해 $y <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>y</mi></math>$ 를 처리한 다음,
  2. Output frame-rate latent representation $e \in R N \times D e n c <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>e</mi><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>N</mi><mo>\times</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>e</mi><mi>n</mi><mi>c</mi></mrow></msub></mrow></msup></math>$ 는 frame $(a i) M i = 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><msub><mi>a</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><msubsup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>M</mi></mrow></msubsup></math>$ 에서 값을 gathering 하여, $˜ e \in R M \times D e n c <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>e</mi><mo stretchy="false">~</mo></mover></mrow><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>M</mi><mo>\times</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>e</mi><mi>n</mi><mi>c</mi></mrow></msub></mrow></msup></math>$ 로 down-sampling 됨
  3. 이후 $˜ e <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>e</mi><mo stretchy="false">~</mo></mover></mrow></math>$ 는 linear projection 된 다음 split 되어 평균 $μ \in R M \times D l a t e n t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>μ</mi><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>M</mi><mo>\times</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>l</mi><mi>a</mi><mi>t</mi><mi>e</mi><mi>n</mi><mi>t</mi></mrow></msub></mrow></msup></math>$ 와 log 분산 $l o g σ \in R M \times D l a t e n t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>l</mi><mi>o</mi><mi>g</mi><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>σ</mi><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>M</mi><mo>\times</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>l</mi><mi>a</mi><mi>t</mi><mi>e</mi><mi>n</mi><mi>t</mi></mrow></msub></mrow></msup></math>$ 를 생성
  4. 최종적으로 $z 0 \in R M \times D l a t e n t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>M</mi><mo>\times</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>l</mi><mi>a</mi><mi>t</mi><mi>e</mi><mi>n</mi><mi>t</mi></mrow></msub></mrow></msup></math>$ 는:
    $q ϕ (z 0 | y, a) := N (z 0; μ, σ) = \prod M i = 1 \prod D l a t e n t k = 1 N ((z 0) i, k; μ i, k, σ i, k) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>y</mi><mo>,</mo><mi>a</mi><mo stretchy="false">)</mo><mo>:=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>;</mo><mi>μ</mi><mo>,</mo><mi>σ</mi><mo stretchy="false">)</mo><mo>=</mo><munderover><mo data-mjx-texclass="OP">\prod</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>M</mi></mrow></munderover><munderover><mo data-mjx-texclass="OP">\prod</mo><mrow data-mjx-texclass="ORD"><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>l</mi><mi>a</mi><mi>t</mi><mi>e</mi><mi>n</mi><mi>t</mi></mrow></msub></mrow></munderover><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><mo stretchy="false">(</mo><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><msub><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>,</mo><mi>k</mi></mrow></msub><mo>;</mo><msub><mi>μ</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>,</mo><mi>k</mi></mrow></msub><mo>,</mo><msub><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>,</mo><mi>k</mi></mrow></msub><mo stretchy="false">)</mo></math>$
- 이때 prior $p (z 0) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo stretchy="false">(</mo><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 는 standard Normal density로써:
  $p (z 0) := \prod M i = 1 \prod D l a t e n t k = 1 N ((z 0) i, k; 0, 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo stretchy="false">(</mo><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo><mo>:=</mo><munderover><mo data-mjx-texclass="OP">\prod</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>M</mi></mrow></munderover><munderover><mo data-mjx-texclass="OP">\prod</mo><mrow data-mjx-texclass="ORD"><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><msub><mi>D</mi><mi>l</mi></msub><mi>a</mi><mi>t</mi><mi>e</mi><mi>n</mi><mi>t</mi></mrow></munderover><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><mo stretchy="false">(</mo><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><msub><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>,</mo><mi>k</mi></mrow></msub><mo>;</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">)</mo></math>$
- Conditional density $p ψ (y | z 0, a) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>ψ</mi></mrow></msub><mo stretchy="false">(</mo><mi>y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>,</mo><mi>a</mi><mo stretchy="false">)</mo></math>$ 는 다음과 같이 정의됨:
  1. $z 0 \in R M \times D l a t e n t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>M</mi><mo>\times</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>l</mi><mi>a</mi><mi>t</mi><mi>e</mi><mi>n</mi><mi>t</mi></mrow></msub></mrow></msup></math>$ 가 alignment $a <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>a</mi></math>$ 에 따라 $˜ z 0 \in R N \times D l a t e n t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mi>z</mi><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>N</mi><mo>\times</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>l</mi><mi>a</mi><mi>t</mi><mi>e</mi><mi>n</mi><mi>t</mi></mrow></msub></mrow></msup></math>$ 로 up-sampling 됨
    - 이때 $\forall 1 \leq i \leq M : (˜ z 0) a i, \cdot = (z 0) i, \cdot <math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">\forall</mi><mn>1</mn><mo>\leq</mo><mi>i</mi><mo>\leq</mo><mi>M</mi><mo>:</mo><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>z</mi><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><msub><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><msub><mi>a</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>,</mo><mo>\cdot</mo></mrow></msub></mrow></msub><mo>=</mo><mo stretchy="false">(</mo><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><msub><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>,</mo><mo>\cdot</mo></mrow></msub></math>$ 이고, $\forall j \notin a : (˜ z 0) j, \cdot = 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">\forall</mi><mi>j</mi><mo>\notin</mo><mi>a</mi><mo>:</mo><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>z</mi><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><msub><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mi>j</mi><mo>,</mo><mo>\cdot</mo></mrow></msub><mo>=</mo><mn>0</mn></math>$
  2. 이후 $˜ z 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mi>z</mi><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 는 $h \in R N \times D d e c <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>h</mi><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>N</mi><mo>\times</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>e</mi><mi>c</mi></mrow></msub></mrow></msup></math>$ 를 얻기 위해 decoder conformer에 전달되고, 각 frame에서 $˜ y \in R N \times D m e l <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>y</mi><mo stretchy="false">~</mo></mover></mrow><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>N</mi><mo>\times</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>e</mi><mi>l</mi></mrow></msub></mrow></msup></math>$ 로 linear projection 됨
  3. 최종적으로 $p ψ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>ψ</mi></mrow></msub></math>$ 는:
    $pψ(y|z0,a):=∏Nj=1∏Dmelk=112bexp(−|yj,k−˜yj,k|b)<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>ψ</mi></mrow></msub><mo stretchy="false">(</mo><mi>y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>,</mo><mi>a</mi><mo stretchy="false">)</mo><mo>:=</mo><munderover><mo data-mjx-texclass="OP">∏</mo><mrow data-mjx-texclass="ORD"><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>N</mi></mrow></munderover><munderover><mo data-mjx-texclass="OP">∏</mo><mrow data-mjx-texclass="ORD"><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>e</mi><mi>l</mi></mrow></msub></mrow></munderover><mfrac><mn>1</mn><mrow><mn>2</mn><mi>b</mi></mrow></mfrac><mi>e</mi><mi>x</mi><mi>p</mi><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mo>−</mo><mfrac><mrow><mo stretchy="false">|</mo><msub><mi>y</mi><mrow data-mjx-texclass="ORD"><mi>j</mi><mo>,</mo><mi>k</mi></mrow></msub><mo>−</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>y</mi><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>j</mi><mo>,</mo><mi>k</mi></mrow></msub><mo stretchy="false">|</mo></mrow><mi>b</mi></mfrac><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$
    - $b \in (0, \infty) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>b</mi><mo>\in</mo><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mi mathvariant="normal">\infty</mi><mo stretchy="false">)</mo></math>$ : 학습을 조절하기 위한 hyper-parameter
- 따라서 DiffVoice의 VAE는 $L V A E = E (y, a) [L ϕ, ψ (y, a)] <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>V</mi><mi>A</mi><mi>E</mi></mrow></msub><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">E</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>y</mi><mo>,</mo><mi>a</mi><mo stretchy="false">)</mo></mrow></msub><mo stretchy="false">[</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>ϕ</mi><mo>,</mo><mi>ψ</mi></mrow></msub><mo stretchy="false">(</mo><mi>y</mi><mo>,</mo><mi>a</mi><mo stretchy="false">)</mo><mo stretchy="false">]</mo></math>$ 를 최적화하여 학습됨
  - $L ϕ, ψ (y, a) = - D K L (q ϕ (z 0 | y, a) | | p (z 0)) + E q ϕ (z 0 | y, a) l o g p ψ (y | z 0, a) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>ϕ</mi><mo>,</mo><mi>ψ</mi></mrow></msub><mo stretchy="false">(</mo><mi>y</mi><mo>,</mo><mi>a</mi><mo stretchy="false">)</mo><mo>=</mo><mo>-</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>K</mi><mi>L</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>y</mi><mo>,</mo><mi>a</mi><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>p</mi><mo stretchy="false">(</mo><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">E</mi></mrow><mrow data-mjx-texclass="ORD"><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>y</mi><mo>,</mo><mi>a</mi><mo stretchy="false">)</mo></mrow></msub><mi>l</mi><mi>o</mi><mi>g</mi><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>ψ</mi></mrow></msub><mo stretchy="false">(</mo><mi>y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>,</mo><mi>a</mi><mo stretchy="false">)</mo></math>$

- Adversarial Training

위의 최적화 식만을 사용하면 VAE는 high-frequency detail이 부족한 spectrogram을 생성하게 됨
- 따라서 DiffVoice VAE의 high-fidelity reconstruction을 보장하기 위해, adversarial loss를 추가
- Adversarial training은 VAE가 먼저 수렴한 다음 적용되고 sepctrogram decoder를 통해 수행됨
  1. Spectrogram decoder는 random initialized 2D convolution stack과 linear projection을 통해 spectrogram residual을 생성
  2. 이때 2D convolution은 spectral norm regularization에 의해 regularize 되고, leaky ReLU를 통해 interleave 됨
  3. Discriminator는 leaky ReLU에 의해 interleave 된 spectral norm regularization을 사용하는 2D convolution stack
- Stochastic map을 $(y, a) \to ˆ y <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mi>y</mi><mo>,</mo><mi>a</mi><mo stretchy="false">)</mo><mo stretchy="false">\to</mo><mrow data-mjx-texclass="ORD"><mover><mi>y</mi><mo stretchy="false">^</mo></mover></mrow></math>$ 로, generator를 $G (y, a) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">G</mi></mrow><mo stretchy="false">(</mo><mi>y</mi><mo>,</mo><mi>a</mi><mo stretchy="false">)</mo></math>$ 로, discriminator를 $D (\cdot) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mo stretchy="false">(</mo><mo>\cdot</mo><mo stretchy="false">)</mo></math>$ 라 하자
  1. Adversarial training을 위해 least square loss $L G, L D <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">G</mi></mrow></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow></mrow></msub></math>$ 와 feature matching loss $L f e a t G <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">G</mi></mrow></mrow><mrow data-mjx-texclass="ORD"><mi>f</mi><mi>e</mi><mi>a</mi><mi>t</mi></mrow></msubsup></math>$ 를 사용하여 $G, D <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">G</mi></mrow><mo>,</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow></math>$ 를 학습
  2. 결과적으로 adversarial training을 위한 최종 loss $L a d v <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>a</mi><mi>d</mi><mi>v</mi></mrow></msub></math>$ 는 $L G, L D, L f e a t G, L V A E <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">G</mi></mrow></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow></mrow></msub><mo>,</mo><msubsup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">G</mi></mrow></mrow><mrow data-mjx-texclass="ORD"><mi>f</mi><mi>e</mi><mi>a</mi><mi>t</mi></mrow></msubsup><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>V</mi><mi>A</mi><mi>E</mi></mrow></msub></math>$ 의 weighted sum:
    
    $L G := E (y, a) [\sum j, k (D j, k (G (y, a)) - 1) 2] <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">G</mi></mrow></mrow></msub><mo>:=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">E</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>y</mi><mo>,</mo><mi>a</mi><mo stretchy="false">)</mo></mrow></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><munder><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>j</mi><mo>,</mo><mi>k</mi></mrow></munder><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mrow data-mjx-texclass="ORD"><mi>j</mi><mo>,</mo><mi>k</mi></mrow></msub><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">G</mi></mrow><mo stretchy="false">(</mo><mi>y</mi><mo>,</mo><mi>a</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>-</mo><mn>1</mn><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>$
    $LfeatG:=E(y,a)[1L∑Lℓ=11dℓ||D(ℓ)(y)−D(ℓ)(G(y,a))||1]<math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">G</mi></mrow></mrow><mrow data-mjx-texclass="ORD"><mi>f</mi><mi>e</mi><mi>a</mi><mi>t</mi></mrow></msubsup><mo>:=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">E</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>y</mi><mo>,</mo><mi>a</mi><mo stretchy="false">)</mo></mrow></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><mfrac><mn>1</mn><mi>L</mi></mfrac><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>ℓ</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>L</mi></mrow></munderover><mfrac><mn>1</mn><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>ℓ</mi></mrow></msub></mfrac><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>ℓ</mi><mo stretchy="false">)</mo></mrow></msup><mo stretchy="false">(</mo><mi>y</mi><mo stretchy="false">)</mo><mo>−</mo><msup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>ℓ</mi><mo stretchy="false">)</mo></mrow></msup><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">G</mi></mrow><mo stretchy="false">(</mo><mi>y</mi><mo>,</mo><mi>a</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>$
    - Discriminator의 output은 2D matrix이고, $D j, k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>j</mi><mo>,</mo><mi>k</mi></mrow></msub></math>$ 는 $j, k <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>j</mi><mo>,</mo><mi>k</mi></math>$ -th value
    - Feature matching loss $L f e a t G <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">G</mi></mrow></mrow><mrow data-mjx-texclass="ORD"><mi>f</mi><mi>e</mi><mi>a</mi><mi>t</mi></mrow></msubsup></math>$ 에서 $L <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi></math>$ 은 $D <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow></math>$ 의 layer 수
    - $D (ℓ) <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>ℓ</mi><mo stretchy="false">)</mo></mrow></msup></math>$ 은 $d ℓ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>ℓ</mi></mrow></msub></math>$ element가 있는 layer $ℓ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ℓ</mi></math>$ 의 hidden feature map

- Latent Diffusion Model

DiffVoice는 VAE가 완전히 학습된 후, weight를 freeze 하여 음성을 latent representation으로 encoding
- Diffusion model을 통해 integer duration sequence $d <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi></math>$ 를 modelling 하기 위해서
  1. $u \sim U n i f o r m [0, 1) M <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>u</mi><mo>\sim</mo><mi>U</mi><mi>n</mi><mi>i</mi><mi>f</mi><mi>o</mi><mi>r</mi><mi>m</mi><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mn>1</mn><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mi>M</mi></mrow></msup></math>$ 으로 sampling 하여 uniform dequantization을 적용한 다음, $˜ d = d - u <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>d</mi><mo stretchy="false">~</mo></mover></mrow><mo>=</mo><mi>d</mi><mo>-</mo><mi>u</mi></math>$ 를 정의함
    - 추가적으로 분포를 normalize 하기 위해 선택된 constant $c 0, c 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>,</mo><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$ 에 $l j := l o g (˜ d j + c 0) + c 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>l</mi><mrow data-mjx-texclass="ORD"><mi>j</mi></mrow></msub><mo>:=</mo><mi>l</mi><mi>o</mi><mi>g</mi><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>d</mi><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>j</mi></mrow></msub><mo>+</mo><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo><mo>+</mo><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$ 를 취함
  2. 이때 $l = (l j) M j = 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>l</mi><mo>=</mo><mo stretchy="false">(</mo><msub><mi>l</mi><mrow data-mjx-texclass="ORD"><mi>j</mi></mrow></msub><msubsup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>M</mi></mrow></msubsup></math>$ 와 $z 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 의 concatentation을 $x 0 := [l; z 0] \in R M \times (D l a t e n t + 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>:=</mo><mo stretchy="false">[</mo><mi>l</mi><mo>;</mo><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">]</mo><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>M</mi><mo>\times</mo><mo stretchy="false">(</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>l</mi><mi>a</mi><mi>t</mi><mi>e</mi><mi>n</mi><mi>t</mi></mrow></msub><mo>+</mo><mn>1</mn><mo stretchy="false">)</mo></mrow></msup></math>$ 로 정의하면,
  3. Latent Diffusion Model은 density $p 0 (x 0 | w) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>w</mi><mo stretchy="false">)</mo></math>$ 에서 sampling을 수행하는 것을 목표로 함
- DiffVoice는 Variance Preserving SDE를 활용하여 generative modelling을 수행함
  1. 여기서 Ito SDE는:
    $dXt=12β(t)Xtdt+√β(t)dBt<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><mi>β</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mi>d</mi><mi>t</mi><mo>+</mo><msqrt><mi>β</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></msqrt><mi>d</mi><msub><mi>B</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$
    - $X t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ : $R M \times (D + 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>M</mi><mo>\times</mo><mo stretchy="false">(</mo><mi>D</mi><mo>+</mo><mn>1</mn><mo stretchy="false">)</mo></mrow></msup></math>$ 에서의 random process
    - $t \in [0, 1] <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi><mo>\in</mo><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">]</mo></math>$ 이고, $B t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>B</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ : $R M \times (D + 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>M</mi><mo>\times</mo><mo stretchy="false">(</mo><mi>D</mi><mo>+</mo><mn>1</mn><mo stretchy="false">)</mo></mrow></msup></math>$ -valued standard Brownian motion
  2. $ˉ α (t) := e x p (- \int t 0 β (s) d s) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>α</mi><mo stretchy="false">¯</mo></mover></mrow><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>:=</mo><mi>e</mi><mi>x</mi><mi>p</mi><mo stretchy="false">(</mo><mo>-</mo><msubsup><mo data-mjx-texclass="OP">\int</mo><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msubsup><mi>β</mi><mo stretchy="false">(</mo><mi>s</mi><mo stretchy="false">)</mo><mi>d</mi><mi>s</mi><mo stretchy="false">)</mo></math>$ 라고 하면, Ito SDE에 대한 transition density는:
    $p 0 t (x t | x 0, w) = N (x t; x 0 \sqrt ˉ α (t), (1 - ˉ α (t)) I) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mn>0</mn><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>,</mo><mi>w</mi><mo stretchy="false">)</mo><mo>=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>;</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><msqrt><mrow data-mjx-texclass="ORD"><mover><mi>α</mi><mo stretchy="false">¯</mo></mover></mrow><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></msqrt><mo>,</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mrow data-mjx-texclass="ORD"><mover><mi>α</mi><mo stretchy="false">¯</mo></mover></mrow><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mi>I</mi><mo stretchy="false">)</mo></math>$
  3. Text를 condition으로 하는 score estimator $s θ (x t, t, w) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>t</mi><mo>,</mo><mi>w</mi><mo stretchy="false">)</mo></math>$ 는, $s θ (x t, t, w) \approx \nabla x t l o g p t (x t | w) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>t</mi><mo>,</mo><mi>w</mi><mo stretchy="false">)</mo><mo>\approx</mo><msub><mi mathvariant="normal">\nabla</mi><mrow data-mjx-texclass="ORD"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mrow></msub><mi>l</mi><mi>o</mi><mi>g</mi><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>w</mi><mo stretchy="false">)</mo></math>$ 의 denoising score matching으로 학습되고, time에 따른 weight $λ t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>λ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 가 적용됨
  4. 결과적으로 SDE 과정에 대한 loss $L S D E <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>S</mi><mi>D</mi><mi>E</mi></mrow></msub></math>$ 는:
- 추론 과정에서는,
  1. Latent Diffusion Model을 사용하여 $p 0 (x 0 | w) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>w</mi><mo stretchy="false">)</mo></math>$ 에서 $x 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 sampling
  2. 이후 $x 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 $l <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>l</mi></math>$ 과 $z 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 로 split 하고
  3. $l <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>l</mi></math>$ 에서 alignment $a <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>a</mi></math>$ 를 reconstruct 한 다음, spectrogram decoder를 통해 $(z 0, a) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>,</mo><mi>a</mi><mo stretchy="false">)</mo></math>$ 에서 log mel-spectrogram $ˆ y <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>y</mi><mo stretchy="false">^</mo></mover></mrow></math>$ 를 decoding

- Solving Inverse Problems with DiffVoice

$A <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">A</mi></mrow></math>$ 가 differentiable 할 때 $o = A (x 0) \in R O <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>o</mi><mo>=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">A</mi></mrow><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>O</mi></mrow></msup></math>$ 라고 가정하면,
$\nabla x t l o g p t (x t | o, w) = \nabla x t l o g p t (x t | w) + \nabla x t l o g p t (o | x t, w) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi mathvariant="normal">\nabla</mi><mrow data-mjx-texclass="ORD"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mrow></msub><mi>l</mi><mi>o</mi><mi>g</mi><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>o</mi><mo>,</mo><mi>w</mi><mo stretchy="false">)</mo><mo>=</mo><msub><mi mathvariant="normal">\nabla</mi><mrow data-mjx-texclass="ORD"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mrow></msub><mi>l</mi><mi>o</mi><mi>g</mi><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>w</mi><mo stretchy="false">)</mo><mo>+</mo><msub><mi mathvariant="normal">\nabla</mi><mrow data-mjx-texclass="ORD"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mrow></msub><mi>l</mi><mi>o</mi><mi>g</mi><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><mi>o</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>w</mi><mo stretchy="false">)</mo></math>$
$p 0 (x 0 | o, w) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>o</mi><mo>,</mo><mi>w</mi><mo stretchy="false">)</mo></math>$ 에서 sampling 하기 위해서, $\nabla x t l o g p t (o | x t, w) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi mathvariant="normal">\nabla</mi><mrow data-mjx-texclass="ORD"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mrow></msub><mi>l</mi><mi>o</mi><mi>g</mi><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><mi>o</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>w</mi><mo stretchy="false">)</mo></math>$ 에 대한 estimator를 추가
- 여기서 $E [x 0 | x t, w] <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">E</mi></mrow><mo stretchy="false">[</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>w</mi><mo stretchy="false">]</mo></math>$ 를 근사하는 $π θ (x t, t, w) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>π</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>t</mi><mo>,</mo><mi>w</mi><mo stretchy="false">)</mo></math>$ 는:
  $πθ(xt,t,w):=1√ˉα(t)(xt+(1−ˉα(t))sθ(xt,t,w))<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>π</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>t</mi><mo>,</mo><mi>w</mi><mo stretchy="false">)</mo><mo>:=</mo><mfrac><mn>1</mn><msqrt><mrow data-mjx-texclass="ORD"><mover><mi>α</mi><mo stretchy="false">¯</mo></mover></mrow><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></msqrt></mfrac><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>−</mo><mrow data-mjx-texclass="ORD"><mover><mi>α</mi><mo stretchy="false">¯</mo></mover></mrow><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>t</mi><mo>,</mo><mi>w</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$
- Weighting function $ξ (t) : [0, 1] \to [0, \infty) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ξ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>:</mo><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">]</mo><mo stretchy="false">\to</mo><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mi mathvariant="normal">\infty</mi><mo stretchy="false">)</mo></math>$ 를 취하면:
  $\nabla x t l o g p t (o | x t, w) \approx - ξ (t) \nabla x t | | A (π θ (x t, t, w)) - A (x 0) | | 22 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi mathvariant="normal">\nabla</mi><mrow data-mjx-texclass="ORD"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mrow></msub><mi>l</mi><mi>o</mi><mi>g</mi><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><mi>o</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>w</mi><mo stretchy="false">)</mo><mo>\approx</mo><mo>-</mo><mi>ξ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><msub><mi mathvariant="normal">\nabla</mi><mrow data-mjx-texclass="ORD"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">A</mi></mrow><mo stretchy="false">(</mo><msub><mi>π</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>t</mi><mo>,</mo><mi>w</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>-</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">A</mi></mrow><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msubsup><mo stretchy="false">|</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup></math>$

DiffVoice Conformer Block의 Hyeprparameter 설정

3. Experiments

- Settings

Dataset : LJSpeech, LibriTTS
Comparisons : FastSpeech2, VITS, Grad-TTS

- Results

Single Speaker Text-to-Speech
- LJSpeech dataset에 대해 single speaker TTS 품질을 비교
- MOS 측면에서 DiffVoice는 가장 우수한 합성 품질을 보임

Multi-Speaker Text-to-Speech
- LibriTTS dataset에 대해 multi-speaker TTS 품질을 비교
- 마찬가지로 DiffVoice는 가장 우수한 합성 품질을 달성

Utterance-level-X-vector를 활용하여 zero-shot adaptation 실험을 진행
- 이때 DiffVoice는 prompt-based zero-shot adaptation을 활용
- 결과적으로 zero-shot 환경에서도 DiffVoice는 가장 우수한 합성 품질을 보임

Text-based Speech Editing
- RetrieverTTS와 비슷하게 text-based speech inpainting에 대한 성능 평가를 진행
- 결과적으로 DiffVoice가 speech inpainting에 대해서도 우수한 성능을 보임

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] GenerSpeech: Toward Style Transfer for Generalizable Out-of-Domain Text-to-Speech (0)	2024.01.30
[Paper 리뷰] VarianceFlow: High-Quality and Controllable Text-to-Speech using Variance Information via Normalizing Flow (0)	2024.01.29
[Paper 리뷰] Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech (0)	2024.01.21
[Paper 리뷰] CyFi-TTS: Cyclic Normalizing Flow with Fine-Grained Representation for End-to-End Text-to-Speech (0)	2024.01.18
[Paper 리뷰] SpeedySpeech: Efficient Neural Speech Synthesis (0)	2024.01.17

Let IT Begin Voice Engineer | 심심하면 앨범 리뷰 올립니다

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] DiffVoice: Text-to-Speech with Latent Diffusion

DiffVoice: Text-to-Speech with Latent Diffusion

1. Introduction

2. DiffVoice

- Dynamic Down-Sampling of Speech

- Adversarial Training

- Latent Diffusion Model

- Solving Inverse Problems with DiffVoice

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역