[Paper 리뷰] FlashSpeech: Efficient Zero-Shot Speech Synthesis

티스토리 뷰

Paper/TTS

[Paper 리뷰] FlashSpeech: Efficient Zero-Shot Speech Synthesis

feVeRin 2024. 11. 24. 11:34

FlashSpeech: Efficient Zero-Shot Speech Synthesis

최근의 large-scale zero-shot speech synthesis는 language model과 diffusion을 기반으로 구축되므로 computationally intensive 하고 generation process가 느림
FlashSpeech
- Latent consistency model을 기반으로 adversarial consistency training을 도입
- Prosody generator module을 통해 prosody diversity를 향상
논문 (MM 2024) : Paper Link

1. Introduction

Text-to-Speech (TTS)에서 zero-shot synthesis는 additional training 없이 reference audio segment에서 unseen speaker characteristic을 반영하는 것을 목표로 함
- 특히 최근의 zero-shot TTS model은 일반적으로 Language Model (LM)이나 diffusion model에 기반함
- BUT, 해당 방식들은 생성 과정에서 long-time iteration이 필요하다는 단점이 있음
  1. 대표적으로 VALL-E는 1-second speech를 위해 75 audio token sequence를 예측하는 autoregressive LM을 사용함
  2. Non-autoregressive latent diffusion model을 기반으로 하는 NaturalSpeech 역시 150 sampling step이 요구됨
  3. 결과적으로 LM이나 diffusion은 human-like speech를 얻을 수는 있지만, 상당한 computational cost가 필요함
    - 근본적으로 autoregressive nature나 여러 번의 denoising step이 필요하기 때문
- 한편으로 VoiceBox, CLaM-TTS 등은 합성 속도를 향상하기 위해 flow-matching, mel-codec 등을 도입했음
  - BUT, 여전히 practical application 측면에서는 활용하기 어렵고 computational overhead가 존재함

-> 그래서 TTS에서 기존 수준의 합성 품질을 유지하면서 합성 속도를 가속할 수 있는 FlashSpeech를 제안

FlashSpeech
- 합성 속도 향상을 위해 Latenct Consistency Model (LCM)을 도입
  - 특히 Non-autoregressive TTS system인 NaturalSpeech를 기반으로 neural audio codec encoder를 도입하여 speech waveform을 latent vector로 변환한 다음 training target으로 사용
- FlashSpeech의 training을 위해 WavLM, HuBERT와 같은 pre-trained speech language model을 discriminator로 활용하는 Adversarial Consistency Training을 적용
- 추가적으로 prosody generator를 통해 stability를 유지하면서 다양한 expression과 prosody를 반영

< Overall of FlashSpeech >

Consistency Adversarial Training을 통해 LCM을 scratch training 하고 prosody generator를 통해 diversity를 개선
결과적으로 zero-shot scenario에서 빠른 합성 속도와 뛰어난 품질을 달성

2. Method

- Overview

FlashSpeech는 speech synthesis efficiency를 향상하여 $O (1) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">O</mi></mrow><mo stretchy="false">(</mo><mn>1</mn><mo stretchy="false">)</mo></math>$ 의 computation cost를 달성하면서 $O (T), O (N) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">O</mi></mrow><mo stretchy="false">(</mo><mi>T</mi><mo stretchy="false">)</mo><mo>,</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">O</mi></mrow><mo stretchy="false">(</mo><mi>N</mi><mo stretchy="false">)</mo></math>$ 의 기존 모델과 유사한 성능을 유지하는 것을 목표로 함
- 구조적으로는 neural codec, phoneme/prompt encoder, prosody generator, LCM으로 구성됨
  - Conditional discriminator는 training 중에만 사용됨
- FlashSpeech는 VALL-E의 in-context learning paradigm을 채택하여 codec에서 추출된 latent vector $z <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow></math>$ 를 $z t a r g e t, z p r o m p t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>a</mi><mi>r</mi><mi>g</mi><mi>e</mi><mi>t</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>r</mi><mi>o</mi><mi>m</mi><mi>p</mi><mi>t</mi></mrow></msub></math>$ 로 segment 함
  1. 이후 encoder는 phoneme과 $z p r o m p t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>r</mi><mi>o</mi><mi>m</mi><mi>p</mi><mi>t</mi></mrow></msub></math>$ 를 기반으로 hidden feature를 생성하고, prosody generator는 해당 hidden feature를 기반으로 pitch, duration을 예측함
  2. Pitch, duration embedding은 hidden feature와 combine 되어 conditional feature로써 LCM에 전달됨
    - 이때 LCM은 adversarial consistency training을 통해 scratch로 training 됨
- 결과적으로 training 이후 FlashSpeech는 1~2 sampling step 만으로도 speech synthesis가 가능함

- Latent Consistency Model

Consistency model은 one-step/few-step generation이 가능한 generative model임
- Data distribution을 $p d a t a (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>a</mi><mi>t</mi><mi>a</mi></mrow></msub><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mo stretchy="false">)</mo></math>$ 라고 하면 consistency model은 PF-ODE의 trajectory에 있는 모든 point를 해당 trajectory의 origin으로 mapping하는 function을 학습하는 것을 목표로 함:
  (Eq. 1) $f (x σ, σ) = x σ min <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mrow data-mjx-texclass="ORD"><mi>σ</mi></mrow></msub><mo>,</mo><mi>σ</mi><mo stretchy="false">)</mo><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mrow data-mjx-texclass="ORD"><msub><mi>σ</mi><mrow data-mjx-texclass="ORD"><mo data-mjx-texclass="OP" movablelimits="true">min</mo></mrow></msub></mrow></msub></math>$
  - $f (\cdot, \cdot)$ : consistency function, $x$ , $σ_{min}$ : fixed small positive number
  - $x_{σ}$ : standard deviation $σ$ 를 가지는 zero-mean Gaussian noise로 perturb 된 data
- 그러면 $x_{σ_{min}}$ 은 data distribution $p_{d a t a} (x)$ 의 approximate sample로 볼 수 있고, (Eq. 1)을 만족하기 위해 consistency model은 다음과 같이 parameterize 됨:
  (Eq. 2) $f_{θ} (x_{σ}, σ) = c_{s k i p} (σ) x + c_{o u t} (σ) F_{θ} (x_{σ}, σ)$
  - $f_{θ}$ : data로부터 학습하여 consistency function $f$ 를 추정하는 것
  - $F_{θ}$ : parameter $θ$ 를 가지는 deep neural network
  - $c_{s k i p} (σ), c_{o u t} (σ)$ : boundary condition을 보장하는 differentiable function, $c_{s k i p} (σ_{min}) = 1, c_{o u t} (σ_{min}) = 0$
- 이때 valid consistency model은 self-consistency property를 만족해야 함:
  (Eq. 3) $f_{θ} (x_{σ}, σ) = f_{θ} (x_{σ^{'}}, σ^{'}), \forall σ, σ^{'} \in [σ_{min}, σ_{max}]$
  - $σ_{max} = 80, σ_{min} = 0.002$
- 그러면 model은 다음을 evaluate 하여 one-step generation이 가능함:
  (Eq. 4) $x_{σ_{min}} = f_{θ} (x_{σ_{max}}, σ_{max})$
  - $x_{σ_{max}} \sim N (0, σ_{max}^{2} I)$
- Audio의 latent space에 consistency model을 적용하기 위해 논문은 codec의 residual quantization layer 이전에 추출된 latent feature $z$ 를 사용함:
  (Eq. 5) $z = CodecEncoder (y)$
  - $y$ : speech waveform
- 추가적으로 prosody generator로 얻어진 feature를 conditional feature $c$ 로써 추가하여 다음의 objective를 얻음:
  (Eq. 6) $f_{θ} (z_{σ}, σ, c) = f_{θ} (z_{σ^{'}}, σ^{'}, c), \forall σ, σ^{'} \in [σ_{min}, σ_{max}]$
- 추론 시에 synthesized waveform $\hat{y}$ 는 codec decoder를 통해 $\hat{z}$ 로부터 변환됨
  1. 이때 predicted $\hat{z}$ 는 one-sampling step으로 얻어짐:
    (Eq. 7) $\hat{z} = f_{θ} (ϵ * σ_{max}, σ_{max})$
  2. 다음의 two-sampling step을 활용할 수도 있음:
    (Eq. 8) ${\hat{z}}_{i n t e r} = f_{θ} (ϵ * σ_{max}, σ_{max})$
    (Eq. 9) $\hat{z} = f_{θ} ({\hat{z}}_{i n t e r} + ϵ * σ_{i n t e r}, σ_{i n t e r})$
    - ${\hat{z}}_{i n t e r}$ : intermediate step, $σ_{i n t e r}$ : 2로 설정
    - $ϵ$ : standard Gaussian distribution에서 sample 된 값

- Adversarial Consistency Training

LCM은 first stage에서 diffusion-based teacher model을 pre-train 한 다음, distillation을 통해 final model을 얻어야 하므로 training process가 복잡하고 distillation으로 인한 성능 제한이 존재함
- 따라서 FlashSpeech는 teacher model에 대한 dependency를 제거하기 위해 LCM을 scratch로 training 하는 Adversarial Consistency Training을 도입함
- 해당 Adversarial Consistency Training은 크게 3 부분으로 구성됨
Consistency Training
- 먼저 (Eq. 3)의 property을 만족하기 위해 다음의 consistency loss를 채택하자:
  (Eq. 10) $L_{c t}^{N} (θ, θ^{-}) = E [λ (σ_{i}) d (f_{θ} (z_{i + 1}, σ_{i + 1}, c), f_{θ^{-}} (z_{i}, σ_{i}, c))]$
  - $σ_{i}$ : discrete time step $i$ 에서의 noise level, $d (\cdot, \cdot)$ : distance function
  - $f_{θ} (z_{i + 1}, σ_{i + 1}, c), f_{θ^{-}} (z_{i}, σ_{i}, c)$ : higher noise level student, lower noise level teacher
- Discrete time step $σ_{min} = σ_{0} < σ_{1} < . . . < σ_{N} = σ_{max}$ 는 time interval $[σ_{min}, σ_{max}]$ 에서 divide 되고, discretization curriculum $N$ 은 training step이 증가함에 따라 함께 증가함:
  (Eq. 11) $N (k) = min (s_{0} 2^{⌊ \frac{k}{K^{'}} ⌋}, s_{1}) + 1$
  - $K^{'} = ⌊ \frac{K}{\log_{2} ⌊ s_{1} / s_{0} ⌋ + 1} ⌋$
  - $k$ : current training step, $K$ : total training step, $s_{1}, s_{0}$ : $N (k)$ 의 size를 control 하는 hyperparameter
- Distance function $d (\cdot, \cdot)$ 은 Pseudo-Huber metric으로써:
  (Eq. 12) $d (x, y) = \sqrt{| | x - y | |^{2} + a^{2}} - a$
  - $a$ : adjustable constant
  - 즉, $ℓ_{2}$ loss보다 큰 error에 대해 smaller penalty를 부과하여 outlier에 대한 training을 robust 하게 만드는 역할
- Teacher model parameter $θ^{-}$ 는:
  (Eq. 13) $θ^{-} \leftarrow stopgrad (θ)$
  - Student parameter $θ$ 와 동일
- Weighting function은:
  (Eq. 14) $λ (σ_{i}) = \frac{1}{σ_{i + 1} - σ_{i}}$
  - Smaller noise level의 loss를 emphasize 하는 역할
- 해당 consistency training을 통해 LCM은 few-step 만으로도 acceptable quality의 음성을 생성할 수 있지만, 기존 수준에는 도달하지 못함
  - 따라서 논문은 sample quality를 더욱 향상하기 위해 adversarial training을 도입함
Adversarial Training
- Adversarial objective를 위해 generated sample $\hat{z} \leftarrow f_{θ} (z_{σ}, σ, c)$ 와 real sample $z$ 는 discriminator $D_{η}$ 에 전달됨
  - 여기서 discriminator는 각각을 distinguish 하는 것을 목표로 함
  - $η$ : traininable parameter
- 그러면 adversarial training loss는:
  (Eq. 15) $L_{a d v} (θ, η) = E_{z} [\log D_{η} (z)] + E_{σ} E_{z_{σ}} [\log (1 - D_{η} (f_{θ} (z_{σ}, σ, c)))]$
- 구체적으로 frozen pre-trained speech language model $S L M$ 과 trainable lightweight discriminator head $D_{h e a d}$ 를 사용하여 discriminator를 구축함
  1. 먼저 $S L M$ 은 speech waveform에서 train 되므로 codec decoder를 사용하여 $z, \hat{z}$ 를 ground-truth waveform과 predicted waveform으로 변환함
  2. 이때 prompt audio와 generated audio 간의 similarity를 향상하기 위해 discriminator는 prompt audio feature에 따라 conditioning 됨
  3. 해당 prompt feature $F_{p r o m p t}$ 는 prompt audio에 $S L M$ 을 통해 추출되고, time-axis에 average pooling을 적용하여 사용됨
- 결과적으로:
  (Eq. 16) $D_{η} = D_{h e a d} (F_{p r o m p t} ⊙ F_{g t}, F_{p r o m p t} ⊙ F_{p r e d})$
  - $F_{g t}, F_{p r e d}$ : $S L M$ 을 통해 얻어진 ground-truth/predicted waveform의 feature
  - Discriminator head는 여러 개의 1D convolution layer로 구성되고, discriminator의 input feature는 $F_{p r o m p t}$ 에 따라 condition 됨
Combined Together
- Consistency loss와 Adversarial loss 간에는 loss scale gap이 존재하므로 training instability와 failure가 발생할 수 있음
- 따라서 FlashSpeech는 다음의 adaptive weight를 적용함:
  (Eq. 17) $λ_{a d v} = \frac{| | \nabla_{θ_{L}} L_{c t}^{N} (θ, θ^{-}) | |}{| | \nabla_{θ_{L}} L_{a d v} (θ, η | |}$
  - $θ_{L}$ : LCM network의 last layer
- 결과적으로 얻어지는 LCM의 final training loss는 $L_{c t}^{N} (θ, θ^{-}) + λ_{a d v} L_{a d v} (θ, η)$ 과 같음
  - 해당 adaptive weighting은 각 term의 gradient scale을 balancing 하여 training을 stabilize 함

- Prosody Generator

Analysis of Prosody Prediction
- FastSpeech2 등에서 사용된 prosody prediction은 unimodal distribution에 대한 가정과 deterministic mapping으로 인해 human speech prosody의 expressiveness를 반영하지 못함
  - 따라서 variation이 부족하고 over-smooth 한 prediction이 발생함
- 한편으로 VoiceBox와 같은 diffusion method는 prosody diversity를 제공할 수 있지만, stability의 문제가 있음
  - 추가적으로 iterative inference로 인해 real-time application에서 활용하기 어려움
  - 마찬가지로 LM-based method인 Mega-TTS, VALL-E 등도 추론에 많은 시간이 필요함
- 따라서 FlashSpeech는 prosody regression module과 prosody refinement를 기반으로 one-step consistency model sampling에 대한 prosody diversity를 향상함
Prosody Refinement via Consistency Model
- FlashSpeech의 prosody generator는 prosody regression과 prosody refinement로 구성됨
  1. 먼저 prosody regression module을 통해 deterministic output을 얻음
  2. 이후 prosody regression module parameter를 freeze 하고 ground-truth prosody와 deterministic predicted prosody 간의 residual을 prosody refinement의 training target으로 설정함
    - 이때 consistency model을 prosody refinement module로 사용
  3. 해당 consistency model의 conditional feature는 final projection layer 이전의 prosody regression으로 얻어진 feature에 해당함
  4. 따라서 stochastic sampler의 residual은 deterministic prosody regression의 output을 refine 하고 동일한 transcription과 audio prompt로부터 plausible prosody를 생성할 수 있음
- 그러면 final prosody output $p_{f i n a l}$ 은 다음과 같이 represent 될 수 있음:
  (Eq. 18) $p_{f i n a l} = p_{r e s} + p_{i n i t}$
  - $p_{f i n a l}$ : final prosody output, $p_{r e s}$ : prosody refinement module의 residual output
  - $p_{i n i t}$ : prosody regression module의 initial deterministic prosody prediction
- BUT, (Eq. 18)의 formulation은 prosody stability에 부정적인 영향을 줄 수 있음
  1. 구체적으로, higher diversity는 stability를 떨어트리고 unnatural prosody를 생성할 수 있음
  2. 따라서 논문은 prosodic output에서 stability와 diversity를 finely tuning 하는 control factor $α$ 를 도입함:
    (Eq. 19) $p_{f i n a l} = α p_{r e s} + p_{i n i t}$
    - $α$ : $[0, 1]$ 사이의 scalar value

3. Experiments

- Settings

Dataset : Multilingual LibriSpeech (MLS)
Comparisons : VALL-E, VoiceBox, NaturalSpeech, Mega-TTS, CLaM-TTS
Hyperparameter
- (Eq. 12)의 $a$ 는 $0.03$ 으로 설정
- (Eq. 10)에서, $σ_{i} = {(σ_{min}^{1 / ρ} + \frac{i - 1}{N (k) - 1} (σ_{max}^{1 / ρ} - σ_{min}^{1 / ρ}))}^{ρ}$
  - $i \in [1, N (k)], ρ = 7, σ_{min} = 0.002, σ_{max} = 80$
- (Eq. 11)의 $N (k)$ 에 대해 $s_{0} = 10, s_{1} = 1280, K = 600 k$ 로 설정
  - Second stage에서는 $s_{1} = 160, K = 150 k$ 로 설정

- Results

FlashSpeech는 가장 낮은 RTF를 가지면서 뛰어난 합성 품질 (MOS)를 보임

User preference 측면에서도 FlashSpeech가 선호됨

Ablation Studies of LCM
- WavLM을 discriminator로 채택하는 경우 가장 우수한 UTMOS, Sim-O를 얻을 수 있음

LCM에 대해 $2$ 의 sampling step을 사용하는 경우 최적의 결과를 달성함

Ablation Studies of Prosody Generator
- $α$ 값에 따라 prosody diversity와 speech intelligibility 간 trade-off가 발생함

Duration 측면에서도 $α$ 에 따라 diversity, stability의 trade-off가 발생함

Voice Conversion
- Voice Conversion 측면에서 DDDM-VC, YourTTS와 비교
- 마찬가지로 FlashSpeech가 가장 뛰어난 성능을 보임

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] DPP-TTS: Diversifying Prosodic Features of Speech via Determinantal Point Process (0)	2024.12.15
[Paper 리뷰] DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance (0)	2024.12.14
[Paper 리뷰] PitchFlow: Adding Pitch Control to a Flow-Matching based TTS Model (0)	2024.11.17
[Paper 리뷰] NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-Robust Expressive TTS (0)	2024.11.10
[Paper 리뷰] GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech (0)	2024.11.09

Let IT Begin Voice Engineer | 심심하면 앨범 리뷰 올립니다

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] FlashSpeech: Efficient Zero-Shot Speech Synthesis

FlashSpeech: Efficient Zero-Shot Speech Synthesis

1. Introduction

2. Method

- Overview

- Latent Consistency Model

- Adversarial Consistency Training

- Prosody Generator

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역