[Paper 리뷰] DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance

티스토리 뷰

Paper/TTS

[Paper 리뷰] DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance

feVeRin 2024. 12. 14. 10:20

DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance

다양한 control demand 하에서 speaker-fidelity와 text-intelligibility 간의 optimal balance를 달성하는 것은 어려움
DualSpeech
- Phoneme-level latent diffusion과 Dual classifier-free guidance를 도입
- Sophisticated control을 통해 fidelity와 intelligibility를 향상
논문 (INTERSPEECH 2024) : Paper Link

1. Introduction

Text-to-Speech (TTS)는 human speech의 speaker identity, speech rhythm, tone, language 등의 다양한 characteristic을 emulate 하여 human-like spectrum을 합성할 수 있어야 함
- 따라서 TTS model은 high speaker-fidelity를 위해 timbre, speaking style 등을 효과적으로 capture하면서 strong text-intelligibility를 유지해야 함
- BUT, speaker-fidelity와 text-intelligibility를 balancing 하는 것은 어려움
  - Voice에 focusing 하면 speech clarity (text-intelligibility)가 저하되고, clarity에 focusing하면 characteristic이 부족한 음성이 생성되므로 speaker-fidelity가 저하됨
- 따라서 이를 해결하기 위해서는:
  1. Generative model 내에서 representation disentanglement를 사용하여 speech의 다양한 aspect를 separate 하고 independently manage 해야 함
    - 이와 관련하여 NANSY는 disentangled feature에서 self-supervised reconstruction이 가능
    - 대표적으로 linguistic feature, fundamental frequency $f 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mn>0</mn></math>$ , periodic/aperiodic amplitude, timbre 등과 같은 interpretable feature를 제공
  2. 한편으로 diffusion-based generative model은 Classifier-Free Guidance (CFG)를 적용하여 independent conditioning, control을 지원할 수 있어야 함
    - 대표적으로 VoiceBox, VoiceLDM 등은 CFG를 통해 enhanced condition manipulation을 지원
    - 특히 VoiceLDM의 CFG mechanism은 environmental, content condition을 independently manipulate 가능

-> 그래서 speaker-fidelity와 text-intelligibility를 control 하기 위해 dual CFG를 활용하는 DualSpeech를 제안

DualSpeech
- 먼저 dual CFG로 high controllability를 달성하기 위해 reference conditioner와 text conditioner를 도입
  - 해당 network는 reference와 text에 highly dependent 한 prior latent를 modeling 하는 역할
- 이후 추론 시 dual CFG weight를 selecting 하여 generated speech의 prosody를 manipulate

< Overall of DualSpeech >

Dual classifier-free guidance를 도입하여 speaker-fidelity, text-intelligibility를 manipulate 하는 TTS model
결과적으로 기존보다 뛰어난 합성 성능을 달성

2. Method

DualSpeech는 NANSY, Variational AutoEncoder (VAE), Latent Diffusion Model (LDM)으로 구성됨
- Mel-spectrogram을 생성하는 기존 TTS model과 달리 DualSpeech는 NANSY feature를 사용함
  1. 논문은 해당 NANSY feature를 추출하기 위해 pre-trained NANSY를 도입하고, linguistic feature, $f 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mn>0</mn></math>$ , periodic amplitude, aperiodic amplitude를 생성하는 NANSY-TTS와 align 함
  2. 이때 aligner는 Monotonic Alignment Search (MAS)를 통해 NANSY linguistic feature를 phoneme과 align 하도록 training 됨
- 해당 pre-trained model을 기반으로 DualSpeech는 VAE training, LDM training의 2-stage로 training 됨
  1. 이때 VAE는 주어진 speech, phoneme으로부터 NANSY feature를 reconstruct 하고
  2. LDM은 주어진 transcription과 reference speech로부터 VAE latent를 생성함

- Phoneme-Level Variational AutoEncoder

DualSpeech의 VAE는 text에서 convert 된 IPA sequence와 speech의 NANSY feature를 포함하는 input을 기반으로 NANSY feature를 reconstruct 함
- 이때 VAE는 DiffVoice와 유사한 phoneme-level bottleneck을 활용함
  - 해당 bottleneck은 phoneme encoder output을 query로 사용하고 concatenated NANSY feature를 key, value로 사용하는 Transformer encoder의 cross-attention mechanism을 통해 구현됨
  - Phoneme-level bottleneck에서 posterior latent는 추정된 평균, 분산을 통해 sampling 됨
- 해당 방식은 frame-level model에 비해 다음의 강점을 가짐:
  1. Phoneme-based representation은 speech sound의 symbolic representation이므로 semantic information을 reliably convey 가능함
  2. Transformer encoder의 computation omplexity는 $O (L 2) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">O</mi></mrow><mo stretchy="false">(</mo><msup><mi>L</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup><mo stretchy="false">)</mo></math>$ 로 scale 되므로 frame-level model에 비해 computationally efficient 함
    - $L <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi></math>$ : sequence length
- 구조적으로 VAE decoder는 latent decoder, duration predictor, phoneme prosody decoder, upsampler, frame decoder로 구성됨
  1. 먼저 latent variable이 latent deocder로 전달되고, 해당 network output은 duration predictor, phoneme prosody decoder, upsampler에 parallel 하게 전달됨
    - 이때 duration predictor와 phoneme prosody decoder는 각각 phoneme-level에서 duration과 $f 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mn>0</mn></math>$ 를 추정함
  2. Upsampler에서 latent decoder의 phoneme-level output은 pre-trained MAS aligner를 따라 frame-level sequence로 upsampling 됨
    - Upsampler architecture는 Parallel Tacotron2의 upsampler와 유사함
  3. 이후 upsampled frame-level feature는 frame decoder에 input 되어 NANSY feautre를 reconstruct 함
- 추가적으로 논문은 adversarial training을 통해 VAE의 성능을 개선함
  - 여기서 discriminator는 least-square loss, feature-matching loss를 포함한 simple convolution network로 구성됨
- 결과적으로 VAE model은 NANSY feature reconstruction loss, phoneme-level $f 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mn>0</mn></math>$ reconstruction loss, duration loss, latent에 대한 KL-divergence, adversarial loss로 training 됨

- Phoneme-Level Latent Diffusion Model

DualSpeech의 LDM은 앞선 pre-trained VAE에 의해 생성된 phoneme-level posterior latent를 추정하도록 train 됨
- 이때 LDM에서 phoneme-level model은 diffusion model의 bottleneck인 iterative denoising에 필요한 computation을 줄여 frame-level model에 비해 computational demand를 크게 절감함
  - 결과적으로 LDM을 통해 prior latent를 생성하고, speaker similarity와 naturalness를 달성하기 위해, 논문은 conditioner와 dual CFG를 사용한 conditional diffusion model을 도입함
- 먼저 conditioner는 reference conditioner, text conditioner로 구성되어 reference speaker, text 모두에 대한 conditional information을 inject 하고 phoneme-wise condition을 생성함
  1. 해당 conditioner는 context encoder의 input을 share 함
    - 여기서 conxtext encoder는 phoneme encoder, context embedding output에서 derive 된 context-aware feature를 modeling 하기 위해 cross-attention을 도입한 Transformer encoder
  2. 이때 context embedding을 얻기 위해 pre-trained XLM-RoBERTa를 활용
- Text conditioner는 text input에만 의존하므로 text conditioner output $c t e x t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>e</mi><mi>x</mi><mi>t</mi></mrow></msub></math>$ 에 대한 CFG weight인 $ω t e x t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>ω</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>e</mi><mi>x</mi><mi>t</mi></mrow></msub></math>$ 를 조정하여 fine intelligibility를 control 함
- Reference conditioner는 zero-shot capability를 향상하기 위해 speaker-aware phoneme-wise conditioning을 생성함
  1. 이때 reference speech에서 speaker style을 capture 하고 zero-shot ability를 지원하기 위해 Retriever를 reference conditioner에 통합함
    - 여기서 reference speech는 target speaker subset에서 sampling 되고 noise corrput 된 다음, random length로 cut 하여 training-inference mismatch를 방지함
  2. Reference speech에서 NANSY feature는 retriever encoder의 cross-attention mechanism에 전달됨
    - 해당 cross-attention query는 prototype과 같은 fixed-length token이고 논문에서는 $60 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>60</mn></math>$ 으로 설정함
    - 결과적으로 해당 transformer output은 reference speech style을 encapsulating 하는 fixed-length token이 됨
  3. 추가적으로 reference conditioner는 해당 speaker token을 cross-attention value로 사용하여 speaker-related condition을 encoding 함
    - 이때 text conditioner의 $ω t e x t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>ω</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>e</mi><mi>x</mi><mi>t</mi></mrow></msub></math>$ 와 마찬가지로 speaker similarity는 reference conditioner output $c s p k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub></math>$ 에 대한 CFG weight인 $ω s p k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>ω</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub></math>$ 를 adjust 하여 modulate 할 수 있음
- DualSpeech의 diffusion model은 DiT architecture와 유사한 transformer encoder를 기반으로 함
  1. 이때 DiT의 adaptive layer norm 대신 DiffWave와 유사하게 2개의 MLP layer 뒤에 condition을 추가하는 방식을 사용함
  2. 그러면 LDM은 WaveGrad와 같이 $L 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$ loss로 training 됨:
    (Eq. 1) $L = | | ϵ - ϵ θ (\sqrt ˉ α t μ + \sqrt 1 - ˉ α t ϵ, t, c s p k, c t e x t) | | 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mo>=</mo><msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">|</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">|</mo><mi>ϵ</mi><mo>-</mo><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><msqrt><msub><mrow data-mjx-texclass="ORD"><mover><mi>α</mi><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt><mi>μ</mi><mo>+</mo><msqrt><mn>1</mn><mo>-</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>α</mi><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt><mi>ϵ</mi><mo>,</mo><mi>t</mi><mo>,</mo><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub><mo>,</mo><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>e</mi><mi>x</mi><mi>t</mi></mrow></msub><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo data-mjx-texclass="CLOSE">|</mo></mrow><mo data-mjx-texclass="CLOSE">|</mo></mrow><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$
    - $ϵ \sim N (0, I) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ϵ</mi><mo>\sim</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mi>I</mi><mo stretchy="false">)</mo></math>$ : noise, $ϵ θ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub></math>$ : diffusion model
    - $μ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>μ</mi></math>$ : VAE로 추정된 평균, $t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi></math>$ : timestep, $ˉ α t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mi>α</mi><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ : time $t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi></math>$ 에서의 noise coefficient
  3. 추론 시에 CFG를 활용하기 위해 $c t e x t, c s p k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>e</mi><mi>x</mi><mi>t</mi></mrow></msub><mo>,</mo><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub></math>$ 에 대한 random dropout을 적용함
    - $c t e x t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>e</mi><mi>x</mi><mi>t</mi></mrow></msub></math>$ 를 $5 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>5</mn></math>$ , $c s p k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub></math>$ 를 $10 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>10</mn></math>$ drop 하고, 둘 모두에 추가로 $10 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>10</mn></math>$ 의 dropout을 적용하여 null-conditioned scenario의 frequency를 promote 함
  4. Training에는 discrete integer diffusion timestep과 noise schedule을 사용함
    - 여기서 $t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi></math>$ 는 $[1, T] <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">[</mo><mn>1</mn><mo>,</mo><mi>T</mi><mo stretchy="false">]</mo></math>$ 에서 uniformly sample 되고 $T = 200 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>T</mi><mo>=</mo><mn>200</mn></math>$ 으로 설정됨
    - 추가적으로 $β i = β 1 + (β T - β 1) (i - 1) / (T - 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo>=</mo><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>+</mo><mo stretchy="false">(</mo><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub><mo>-</mo><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo stretchy="false">)</mo><mo stretchy="false">(</mo><mi>i</mi><mo>-</mo><mn>1</mn><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo>/</mo></mrow><mo stretchy="false">(</mo><mi>T</mi><mo>-</mo><mn>1</mn><mo stretchy="false">)</mo></math>$ 로 정의되는 linear variance schedule을 사용하고 $β 1 = 0.0001, β T = 0.03 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>=</mo><mn>0.0001</mn><mo>,</mo><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub><mo>=</mo><mn>0.03</mn></math>$ 으로 설정함
    - Noise coefficient $ˉ α t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mi>α</mi><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 는 $ˉ α t = \prod t i = 1 α i = \prod t i = 1 (1 - β i) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mi>α</mi><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><munderover><mo data-mjx-texclass="OP">\prod</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></munderover><msub><mi>α</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo>=</mo><munderover><mo data-mjx-texclass="OP">\prod</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></munderover><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo stretchy="false">)</mo></math>$ 로 계산됨

- Dual Classifier-Free Guidance for TTS

DualSpeech는 text와 reference condition 간의 fine control을 통해 latent를 생성함
그러면 DualSpeech는 dual CFG를 사용하여 TTS를 다음과 같이 represent 할 수 있음:
(Eq. 2) $˜ ϵ θ (z t, t, c s p k, c t e x t) = ϵ (z t, t, c s p k, c t e x t) + ω s p k (ϵ θ (z t, t, c s p k, \emptyset) - ϵ θ (z t, t, \emptyset, \emptyset)) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mi>ϵ</mi><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>t</mi><mo>,</mo><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub><mo>,</mo><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>e</mi><mi>x</mi><mi>t</mi></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mi>ϵ</mi><mo stretchy="false">(</mo><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>t</mi><mo>,</mo><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub><mo>,</mo><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>e</mi><mi>x</mi><mi>t</mi></mrow></msub><mo stretchy="false">)</mo><mo>+</mo><msub><mi>ω</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>t</mi><mo>,</mo><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub><mo>,</mo><mi mathvariant="normal">\emptyset</mi><mo stretchy="false">)</mo><mo>-</mo><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>t</mi><mo>,</mo><mi mathvariant="normal">\emptyset</mi><mo>,</mo><mi mathvariant="normal">\emptyset</mi><mo stretchy="false">)</mo><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$

- $˜ e θ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mi>e</mi><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub></math>$ : classifier free-guided noise
- $z t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ : timestep $t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi></math>$ 에서의 latent, $z t = \sqrt ˉ α t μ + \sqrt 1 - ˉ α t ϵ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><msqrt><msub><mrow data-mjx-texclass="ORD"><mover><mi>α</mi><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt><mi>μ</mi><mo>+</mo><msqrt><mn>1</mn><mo>-</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>α</mi><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt><mi>ϵ</mi></math>$
- $\emptyset <math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">\emptyset</mi></math>$ : null-conditioned state에 대한 zero-tensor
이때 DualSpeech는 speaker style이 아닌 acoustic environment description에 대해 conditioning을 수행함
1. 해당 constraint는 speaker style에 대한 caption 뿐만 아니라 CLAP의 영향을 받기 때문
2. 이를 통해 speech synthesis를 granular manipulation 할 수 있으므로 text, speaker similarity 간의 balancing 문제를 해결 가능

- Inference

추론 시에는 LDM, VAE decoder, NANSY synthesizer 만이 사용됨
- 먼저 LDM은 iterative denoising을 통해 phoneme-level latent를 생성함
  - 이때 fast sampling을 채택하여 다음과 같은 variance noise schedule을 사용:
  $[1 e - 4, 5 e - 4, 1 e - 3, 5 e - 3, 0.01, 0.02, 0.05, 0.2, 0.3, 0.5, 0.4, 0.3, 0.3, 0.2, 0.1, 0.1] <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">[</mo><mn>1</mn><mi>e</mi><mo>-</mo><mn>4</mn><mo>,</mo><mn>5</mn><mi>e</mi><mo>-</mo><mn>4</mn><mo>,</mo><mn>1</mn><mi>e</mi><mo>-</mo><mn>3</mn><mo>,</mo><mn>5</mn><mi>e</mi><mo>-</mo><mn>3</mn><mo>,</mo><mn>0.01</mn><mo>,</mo><mn>0.02</mn><mo>,</mo><mn>0.05</mn><mo>,</mo><mn>0.2</mn><mo>,</mo><mn>0.3</mn><mo>,</mo><mn>0.5</mn><mo>,</mo><mn>0.4</mn><mo>,</mo><mn>0.3</mn><mo>,</mo><mn>0.3</mn><mo>,</mo><mn>0.2</mn><mo>,</mo><mn>0.1</mn><mo>,</mo><mn>0.1</mn><mo stretchy="false">]</mo></math>$
- Generated prior latent는 NANSY frame-level upsampling과 NANSY linguistic, $f 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mn>0</mn></math>$ , amplitude 추정이 포함된 VAE decoder를 통해 처리됨
- 이후 최종적으로 NANSY synthesizer를 통해 raw waveform을 합성함

3. Experiments

- Settings

Dataset : LJSpeech, VCTK, HiFi-TTS, LibriTTS
Comparisons : YourTTS, StyleTTS2, HierSpeech++

- Results

Subjective Evaluation
- MOS 측면에서 DualSpeech가 가장 우수한 성능을 보임

Objective Evaluation
- WER, CER 측면에서도 DualSpeech의 성능이 가장 우수함

Inference 속도도 DualSpeech가 가장 빠름

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] FastPitchFormant: Source-Filter based Decomposed Modeling for Speech Synthesis (0)	2024.12.21
[Paper 리뷰] DPP-TTS: Diversifying Prosodic Features of Speech via Determinantal Point Process (0)	2024.12.15
[Paper 리뷰] FlashSpeech: Efficient Zero-Shot Speech Synthesis (0)	2024.11.24
[Paper 리뷰] PitchFlow: Adding Pitch Control to a Flow-Matching based TTS Model (0)	2024.11.17
[Paper 리뷰] NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-Robust Expressive TTS (0)	2024.11.10

Let IT Begin Voice Engineer | 심심하면 앨범 리뷰 올립니다

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance

DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance

1. Introduction

2. Method

- Phoneme-Level Variational AutoEncoder

- Phoneme-Level Latent Diffusion Model

- Dual Classifier-Free Guidance for TTS

- Inference

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역