[Paper 리뷰] DetailTTS: Learning Residual Detail Information for Zero-Shot Text-to-Speech

티스토리 뷰

Paper/TTS

[Paper 리뷰] DetailTTS: Learning Residual Detail Information for Zero-Shot Text-to-Speech

feVeRin 2025. 4. 9. 17:42

DetailTTS: Learning Residual Detail Information for Zero-Shot Text-to-Speech

기존 text-to-speech system은 linguistic, acoustic detail을 omission 하는 경우가 많음
DetailTTS
- Conditional Variational AutoEncoder를 기반으로 하는 zero-shot text-to-speech model
- Alignment 과정에서 missed residual detail information을 capture 하는 Prior Detail module과 Duration Detail module을 도입
논문 (ICASSP 2025) : Paper Link

1. Introduction

Zero-shot Text-to-Speech (TTS) system은 target speaker sample에 대한 training 없이 audio prompt를 사용하여 flexibility를 향상할 수 있음
- 이를 위해 일반적으로 AutoRegressive (AR), Non-AutoRegressive (NAR) 방식을 활용함:
  1. VALL-E, VoiceCraft와 같은 AR method는 current context에서 next output을 sequentially predict 하는 방식으로 동작함
    - BUT, sequential nature로 인해 synthesis speed가 저하되고 word missing/repeating과 같은 stability 문제가 발생할 수 있음
  2. VITS, NaturalSpeech, HierSpeech와 같은 NAR method는 entire sequence를 parallel generate 할 수 있음
    - BUT, single vector로는 individual speaker의 complex characteristic을 효과적으로 capture 하지 못함
- 특히 zero-shot TTS의 경우 text, speech를 directly align 하기 어려우므로, mispronunciation과 over-smoothing 문제가 나타남
  1. 해당 information gap을 줄이기 위해 Wav2Vec 2.0과 같은 self-supervised representation을 활용하거나 additional $F0$ representation을 도입할 수 있지만, computational cost가 증가되는 단점이 있음
  2. 추가적으로 initial learning process에서 residual detail information을 효과적으로 활용하지 못함

-> 그래서 text, speech 간의 information gap을 줄이기 위해, residual detail information을 학습하는 zero-shot TTS model인 DetailTTS를 제안

DetailTTS
- Residual detail information에 대한 additional encoding을 통해 enhance 된 Conditional Variational AutoEncoder (CVAE)를 활용
- Prior Detail module, Duration Detail module을 도입하여 text, speech alignment 중에 missing 되는 residual detail information을 capture

< Overall of DetailTTS >

Residual detail information을 capture 하는 CVAE-based zero-shot TTS model
결과적으로 기존보다 뛰어난 zero-shot 합성 성능을 달성

2. Method

DetailTTS는 VITS의 CVAE와 normalizing flow를 기반으로 residual detail information modeling을 도입함
- 먼저 기존 stochastic duration predictor를 Duration Detail module로 replace 함
  - 이때 learnable duration detail embedding을 incorporate 하여 duration predictor accuracy를 향상함
- 추가적으로 training 중에 linear spectrogram을 additional supercisory signal로 사용하는 Prior Detail module을 적용함
  - 이를 통해 text에서 acoustic information을 directly infer 할 때 missing detail information을 capture 할 수 있도록 함

- Duration Detail Module

기존의 stochastic duration predictor는 text (input phoneme $c_{text}$)로부터 duration distribution을 추정함
- BUT, 이 과정에서 residual detail information이 missing 될 수 있음
- 따라서 논문은 hidden text representation $h_{text}$로부터 duration detail embedding $\widehat{dur}_{detail}$을 학습하는 Duration Detail Encoder를 도입함
  1. 즉, Duration Detail Encoder는 training 중에 duration detail loss $\mathcal{L}_{dur\text{_}detail}$을 minimize 하여 rich residual detail information을 capture 함
  2. 해당 loss는 duration detail embedding $\widehat{dur}_{detail}$과 Monotonic Alignment Search (MAS), high dimensional mapping을 통해 얻어진 duration detail distribution $dur^{*}_{detail}$을 사용하여 compute 됨:
    (Eq. 1) $\mathcal{L}_{dur\text{_}detail}=\text{MSE}(dur_{detail}^{*},\widehat{dur}_{detail})$
    - $\text{MSE}$ : mean square error loss
  3. 추가적으로 MAS로 estimate 된 duration $dur^{*}$와 duration predictor가 predict 한 $\widehat{dur}$ 간의 duration loss $\mathcal{L}_{dur}$는:
    (Eq. 2) $\mathcal{L}_{dur}=\text{MSE}(dur^{*},\widehat{dur})$
- Duration detail loss는 subtle detail information을 capture 하도록 하고, duration loss는 overall duration prediction accuracy가 maintain 되도록 함
- 결과적으로 추론 시에 text-based conditional input 만으로도 Duration Detail Encoder는 additional residual duration detail information을 제공할 수 있음
  - 이를 통해 duration predictor는 accurate prediction을 수행할 수 있으므로, naturalness가 향상됨

- Prior Detail Module

논문은 missing residual detail information을 compensate 하기 위해 Prior Detail Module을 도입함
- 해당 module은 original input (linear spectrogram) 만을 사용해 posterior encoder에 fine-grained prior information을 제공함으로써, reconstruction quality를 향상하는 것을 목표로 함
  - 특히 VITS-based model에서 explict in-context learning의 부족을 compensate 하여 speaker-specific nuance를 효과적으로 capture 하도록 함
- 이를 위해 Prior Detail Encoder는 condition $c_{repeat}$가 주어진 prior detail distribution $prior^{*}_{detail}$을 학습하고 residual connection을 incorporate 함
  1. 즉, Prior Detail Encoder는 linear spectrogram을 supervisory signal로 사용하여 prior detail embedding $\widehat{prior}_{detail}$을 생성함
  2. 이때 loss $\mathcal{L}_{prior\text{_}detail}$은 다음과 같음:
    (Eq. 3) $\mathcal{L}_{prior\text{_}detail}=\text{MSE}(prior_{detail}^{*},\widehat{prior}_{detail})$
    - 결과적으로 해당 방식을 통해 detail information을 incorporate 할 수 있고, posterior encoder는 richer information을 가지는 $c_{detail}$을 사용하여 natural speech를 reconstruct 할 수 있음
- 추가적으로 논문은 다음의 2가지 KL-divergence loss를 고려함:
  (Eq. 4) $\mathcal{L}_{kl1}=\log q_{\phi}(z|x)-\log p_{\theta}(z|c_{repeat})$
  (Eq. 5) $\mathcal{L}_{kl2}=\log q_{\phi}(z|x)-\log p_{\theta}(z|c_{repeat}+c_{detail})$
  (Eq. 6) $z\sim q_{\phi}(z|x)=\mathcal{N}(z;\mu_{\phi}(x),\sigma_{\phi}(x))$
  (Eq. 7) $p_{\theta}(z|c)=\mathcal{N}(f(z);\mu_{\theta}(c),\sigma_{\theta}(c))\left|\det\frac{\partial f(z)}{\partial z}\right|$
  - $p_{\theta}(z|c)$ : condition $c$가 주어졌을 때 latent variable $z$의 prior distribution
  - $q_{\phi}(z|x)$ : data $x$의 approximate posterior distribution
  - $f(z)$ : variable $z$에 대한 normalizing flow
  - 여기서 $c$는 $c_{text},c_{repeat}, c_{repeat}+c_{detail}$ 중 하나가 될 수 있음
- 결과적으로 DetailTTS의 synthesizer는 data $\log p_{\theta}(x|c)$의 intractable marginal log-likelihood에 대한 Evidence Lower BOund (ELBO)를 maximize 함:
  (Eq. 8) $\log p_{\theta}(x|c)\geq \mathbb{E}_{q_{\phi}(z|x)}\left[\log p_{\theta}(x|z) - \log \frac{q_{\phi}(z|x)}{p_{\theta}(z|c_{repeat})}-\log \frac{q_{\phi}(z|x)}{p_{\theta}(z|c_{repeat}+c_{detail})}\right]$

- Total Loss

Training을 위한 final total loss는:
(Eq. 9) $\mathcal{L}_{total}=\mathcal{L}_{adv}(G)+\mathcal{L}_{fm}(G)+\mathcal{L}_{recon}+ \mathcal{L}_{kl1}+\mathcal{L}_{kl2}+\mathcal{L}_{dur}+\mathcal{L}_{dur\text{_}detail}+ \mathcal{L}_{prior\text{_}detail}$
- $\mathcal{L}_{adv}(G), \mathcal{L}_{fm}(G)$ : adversarial training loss
- $\mathcal{L}_{recon}$ : reconstruction loss

3. Experiments

- Settings

Dataset : WeNetSpeech4TTS
Comparisons : VALL-E, NaturalSpeech2

- Results

전체적으로 DetailTTS의 성능이 가장 뛰어남

$t$-SNE 측면에서도 speaker embedding과 ground-truth는 closely cluster 되어 나타남

Ablation Study
- 각 component를 제거하는 경우 성능 저하가 발생함

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] LEF-TTS: Lightweight and Efficient End-to-End Text-to-Speech Synthesis with Multi-Stream Generator (0)	2025.04.18
[Paper 리뷰] Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting (0)	2025.04.15
[Paper 리뷰] UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts (0)	2025.04.03
[Paper 리뷰] VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance (0)	2025.04.02
[Paper 리뷰] NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers (0)	2025.03.26

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] DetailTTS: Learning Residual Detail Information for Zero-Shot Text-to-Speech

DetailTTS: Learning Residual Detail Information for Zero-Shot Text-to-Speech

1. Introduction

2. Method

- Duration Detail Module

- Prior Detail Module

- Total Loss

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바