[Paper 리뷰] Evidential-TTS: High Fidelity Zero-Shot Text-to-Speech Using Evidential Deep Learning

티스토리 뷰

Paper/TTS

[Paper 리뷰] Evidential-TTS: High Fidelity Zero-Shot Text-to-Speech Using Evidential Deep Learning

feVeRin 2025. 4. 23. 17:50

Evidential-TTS: High Fidelity Zero-Shot Text-to-Speech Using Evidential Deep Learning

Zero-shot text-to-speech를 위해 Evidential Deep Learning을 활용할 수 있음
Evidiential-TTS
- Iterative Parallel Decoding을 사용하여 aligned phoneme sequence를 acoustic token으로 convert
- Evidential Deep Learning optimization에 기반한 model uncertainty를 도입해 high quality speech generation을 위한 reliable sampling path를 제공
논문 (ICASSP 2025) : Paper Link

1. Introduction

Audio generation을 위해 SoundStrom과 같이 Masked Language Modeling (MLM)과 Iterative Parallel Decoding (IPD)를 활용할 수 있음
- 특히 해당 generative modeling method는 Text-to-Speech (TTS) task에서 SoundStream, HiFi-Codec과 같은 neural audio codec과 함께 사용될 수 있음
  1. 먼저 IPD framework에서 model은 모든 audio token이 mask 된 상태로 initialize 되고, conditioning signal이 주어졌을 때 multiple iteration에 따라 progressively filling 함
    - 이때 masked slot은 frame confidence에 따라 filling 되고, high-confidence frame에서 low-confidence로 진행됨
  2. 즉, confident frame을 우선적으로 생성한 다음, pre-generated context를 기반으로 ambiguous frame을 생성하는 것을 목표로 함
    - 해당 IPD sampling은 parallel processing이 가능하므로 bidirectional context와 fast inference를 제공할 수 있음
- BUT, IPD-based method는 predicted class의 categorical probability를 confidence score로 활용함
  - 여기서 해당 confidence score는 data uncertainty (aleatoric uncertainty)를 의미함
- 따라서 model은 unseen data에 대해 overconfident 하기 쉬우므로 zero-shot inference시 robust 한 성능을 확보하기 어려움
  - 이를 해결하기 위해서는 model uncertainty를 반영해야 함

-> 그래서 Evidential Deep Learning을 기반으로 model uncertainty를 반영한 zero-shot TTS model인 Evidential-TTS를 제안

Evidential-TTS
- IPD-based audio generation model과 length regulating module을 integrate 하여 end-to-end TTS를 지원
- 추가적으로 overconfidence 문제를 해결하기 위해 Evidential Deep Learning (EDL)을 사용해 model을 optimize
  - EDL을 통해 model uncertainty를 quantify 하여 sampling trajectory를 guide 하는 confidence score로 사용함

< Overall of Evidential-TTS >

IPD-based generation model과 EDL을 활용한 zero-shot TTS model
결과적으로 기존보다 우수한 합성 성능을 달성

2. Background

- Evidential Deep Learning for Uncertainty Quantification

Evidential Deep Learning (EDL)은 model uncertainty를 quantifying 하기 위한 framework를 제공함
- Categorical distribution의 probability logit을 output 하는 기존 방식과 달리 EDL-based classification model은 Dirichlet distribution의 parameter $\alpha$를 predict 함
  1. 여기서 Dirichlet distribution은 output probability의 distribution을 modeling 함:
    (Eq. 1) $\mu\sim Dir(\alpha)$
    - $\alpha=[\alpha_{1},\alpha_{2},...,\alpha_{C}]$ : 각 $\alpha_{i}$가 positive value인 Dirichlet parameter vector
    - $C$ : class 수
  2. 그러면 class probability에 대한 기댓값 $\mu$는:
    (Eq. 2) $\bar{\mu}=\mathbb{E}[\mu]=\frac{\alpha}{S}$
    - $S=\sum_{c=1}^{C}\alpha_{c}$ : Dirichlet strength
  3. Predicted class $c_{p}$는 $c_{p}=\arg\max_{c}\bar{\mu}_{c}$로 얻어짐
- EDL을 사용하여 model을 optimize 하려면 predicted class probability의 Shannon entropy를 calculating 하여 total uncertainty를 estimate 해야 함
- 이때 total uncertainty는 data uncertainty와 model (epistemic, distributional) uncertainty로 decompose 됨
  1. Data uncertainty는 input noise로 발생하고, model uncertainty는 train/test data difference로 인해 발생함
  2. 각 uncertainty는 Dirichlet parameter $\alpha$로 calculate 할 수 있음:
    (Eq. 3) $U_{data}=\sum_{c=1}^{C}\bar{\mu}_{c}\left(\psi(S+1)-\psi(\alpha_{c}+1)\right)$
    (Eq. 4) $U_{model}=-\sum_{c=1}^{C}\bar{\mu}_{c}\left(\log \bar{\mu}_{c}+\psi(S+1)-\psi(\alpha_{c}+1)\right)$
    - $\psi(\cdot)$ : digamma function

3. Method

- Problem Formulation

Evidential-TTS는 specific speaker characteristic을 capture 하는 acoustic prompt로 guide 되는 pre-aligned phoneme sequence로부터 acoustic token을 생성하는 것을 목표로 함
- 즉, 논문은 phoneme duration을 represent 하는 alignment information $\mathbf{A}$가 주어졌을 때, 해당 sequence-to-sequence (seq2seq) task를 masked language modeling framework로 formulate 함
- 특히 acoustic tokenization을 위해 HiFi-Codec의 Group Residual Vector Quantization (GRVQ)을 사용함
  1. 먼저 $\mathbf{Y}\in\mathbb{R}^{T\times Q}$를 acoustic token이라고 하자
    - $T$ : frame 수, $Q$ : quantization depth
  2. $\mathbf{Y}_{M}$을 masked target acoustic token, $\mathbf{Y}_{U}$를 unmasked token이라고 하면, GRVQ acoustic token에 대한 masking startegy인 G-MLM을 적용할 수 있음
  3. 그러면 model은 masked acoustic token을 predict 하는 log-likelihood를 maximize 하도록 training 됨:
    (Eq. 5) $\mathcal{L}_{MLM}=-\sum_{\forall y \in\mathbf{Y}_{M}}\log P(y|\mathbf{Y}_{U},\mathbf{A},\mathbf{c},\omega_{a})$
    - $\mathbf{c}$ : phoneme sequence, $\omega_{a}$ : acoustic prompt
    - Acoustic prompt는 prompt speaker의 detailed acoustic attribute (timbre, acoustic condition 등)을 제공함

- Model Overview

Length Regulating Module
- Training 중에 alignment $\mathbf{A}$를 얻기 위해 Monotonic Alignment Search (MAS)를 사용하여 aligned text encoder representation $\omega_{p}$와 mel-spectrogram $\mathbf{x}$ 간의 Mean Squared Error (MSE)를 minimize 함
  - Alignment loss function은 $\mathcal{L}_{align} =\text{MSE}(\omega_{p},\mathbf{x})$와 같음
- 추론 시에는 text-token duration을 reproduce 하기 위해 VITS의 Stochastic Duration Predictor (SDP)를 도입함
  - 이때 loss function $\mathcal{L}_{dur}$를 ECAPA-TDNN에서 얻어진 global speaker embedding에 conditioning 하여 speaker-dependent duration을 반영함
Evidential Token Generator
- Evidential token generator는 prompt network와 prediction network로 구성된 G-MLM을 기반으로 함
  1. 여기서 cross-attention module은 in-context learning을 통해 prompt speaker의 frame-level acoustic information (timbre, acoustic detail 등)을 incorporate 함
  2. 추가적으로 Dirchlet parameter는 positive value이므로 softmax activation을 softplus activation으로 replace 함
- Training 시에는 standard masked language modeling과 달리 modified cross-entropy loss를 사용하므로, 논문은 Dirchlet distribution을 prior로 incorporate 하고 frame-level masked acoustic token으로 extend 함
  1. 먼저 $l_{c,t}$를 frame index $t$에서 ground-truth class $c$를 encoding 하는 one-hot vector, $\alpha_{c,t}$를 frame $t$에서 class $c$에 대한 Dirichlet parameter, $S_{t}$를 frame $t$에서의 Dirichlet strength $S_{t}=\sum_{c=1}^{C}\alpha_{c,t}$라고 하자
  2. 그러면 masked token에 대한 modified cross-entropy loss는:
    (Eq. 6) $\mathcal{L}_{E\text{-}MLM}=\frac{1}{|M|}\sum_{t\in M}\sum_{c=1}^{C}l_{c,t}\left( \log S_{t}-\log \alpha_{c,t}\right)$
    - $M$ : masked token index set, $|M|$ : masked token 수
  3. 추가적으로 loss function에는 masked frame에 대한 KL-divergence term이 추가되어 predictive distribution을 regularize 함
  4. 이때 얻어지는 evidential token generator에 대한 loss function은:
    (Eq. 7) $\mathcal{L}_{evid}=\mathcal{L}_{E\text{-}MLM}+\lambda_{reg}\cdot \mathcal{L}_{KL}$
    - $\lambda_{reg}$ : regularization coefficient
    - 해당 loss는 model output logit을 evidence로 하여 model uncertainty를 express 함
- 결과적으로 Evidential-TTS에 대한 total loss function은:
  (Eq. 8) $\mathcal{L}_{tot}=\mathcal{L}_{evid}+\mathcal{L}_{dur}+\mathcal{L}_{align}$
  - (Eq. 8)을 통해 training 한 이후, Evidential-TTS는 sampling path를 결정하는데 필요한 frame-level uncertainty를 express 할 수 있음

- Evidential Iterative Parallel Decoding

논문은 coarse-grained token $(Y_{1:T,1},Y_{1:T,2})$와 fine-grained token $(Y_{1:T,3},Y_{1:T,4})$의 dual stream으로 구성된 bi-group GRVQ acoutic token을 사용함
- 먼저 RVQ hierarchy level 간의 conditional dependency를 반영하기 위해 acoustic token stream을 coarse-to-fine order로 generate 함
  1. IPD는 complex coarse-grained token에 적용되고, fine-grained token은 greedy sampling을 통해 얻어짐
  2. IPD process에서 model은 arbitrary order로 multiple parallel generation을 통해 coarse-grained token frame을 progressively filling 함
    - 여기서 parallel generation 수 $N_{c}$는 total frame length $T$보다 significantly small 함 ($N_{c}\ll T$)
- 이때 논문은 IPD에서 reliable sampling order를 보장하기 위해, predicted class의 classification probability 대신 frame-level uncertainty의 negative value $-U_{model}$을 confidence score로 사용함
  - 이를 통해 evidential token generator는 reliable sampling path를 구축할 수 있고, 결과적으로 성능이 향상됨
- 해당 confidence metric은 다음의 장점을 가짐:
  1. Neural network output을 exponentiation 하여 predicted class probability를 exaggerate 하는 softmax operator를 제거하여 prediction에 대한 overconfidence를 방지할 수 있음
  2. Model uncertainty는 input에 대한 model의 knowledge 부족을 반영하므로, out-of-domain dataset에 대한 robust prediction을 제공함

4. Experiments

- Settings

Dataset : LibriTTS
Comparisons : VITS, VALLE-X, TokenTransducer

- Results

전체적으로 Evidential-TTS는 우수한 성능을 달성함

Ablation Study
- Model uncertainty를 integrate 하면 data uncertainty metric보다 더 나은 sampling trajectory를 얻을 수 있음

Probability-based confidence metric은 noise input frame의 distinction을 capture 하지 못하지만 model uncertainty는 noisy frame에 대해 low confidence를 가지므로 reliable sampling order를 가질 수 있음

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] ATP-TTS: Adaptive Thresholding Pseudo-Labeling for Low-Resource Multi-Speaker Text-to-Speech (0)	2025.04.30
[Paper 리뷰] SSR-Speech: Towards Stable, Safe and Robust Zero-Shot Text-based Speech Editing and Synthesis (0)	2025.04.29
[Paper 리뷰] LEF-TTS: Lightweight and Efficient End-to-End Text-to-Speech Synthesis with Multi-Stream Generator (0)	2025.04.18
[Paper 리뷰] Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting (0)	2025.04.15
[Paper 리뷰] DetailTTS: Learning Residual Detail Information for Zero-Shot Text-to-Speech (0)	2025.04.09

최근에 올라온 글

최근에 달린 댓글

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Evidential-TTS: High Fidelity Zero-Shot Text-to-Speech Using Evidential Deep Learning

Evidential-TTS: High Fidelity Zero-Shot Text-to-Speech Using Evidential Deep Learning

1. Introduction

2. Background

- Evidential Deep Learning for Uncertainty Quantification

3. Method

- Problem Formulation

- Model Overview

- Evidential Iterative Parallel Decoding

4. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바