[Paper 리뷰] Factorized-VITS: Decoding Prosody and Text in End-to-End Speech Synthesis without External or Secondary Aligner

티스토리 뷰

Paper/TTS

[Paper 리뷰] Factorized-VITS: Decoding Prosody and Text in End-to-End Speech Synthesis without External or Secondary Aligner

feVeRin 2025. 5. 13. 17:19

Factorized-VITS: Decoding Prosody and Text in End-to-End Speech Synthesis without External or Secondary Aligner

Explicit text-side prosody modeling을 incorporate 하면 end-to-end text-to-speech 성능을 향상할 수 있음
Factorized-VITS
- Audio prior hidden space를 text, prosody subspace로 clean factorize
- Extra parameter 없이 factorized text subspace에서 on-the-fly alignment를 수행
논문 (ICASSP 2025) : Paper Link

1. Introduction

Text-to-Speech (TTS)는 일반적으로 sequence-to-sequence mapping task로 수행되므로 동일한 text가 여러 방식으로 spoken 될 수 있는 one-to-many mapping 문제가 존재함
- 이를 해결하기 위해서는 input text에 prosody, speaker identity, emotion과 같은 additional information을 반영할 수 있어야 함
  - BUT, 대부분의 TTS model은 여전히 reconstruction loss에 의존하므로 speech variation의 complexity를 fully capture 하지 못함
- 한편으로 Glow-TTS와 같이 single deterministic point 대신 output distribution을 predict 하는 stochastic architecture를 활용할 수도 있음
  1. 해당 stochastic 방식은 deterministic model과 비교하여 더 diversity 한 speech를 생성함
  2. 특히 VITS와 같은 stochastic model에 pitch, energy에 대한 supervised input을 incorporate 하면 explicit text-side prosody modeling control이 가능함
    - BUT, 아래 그림과 같이 alignment calculation 이전에 text-level에서 prosody sequence를 얻을 수 없으므로 on-the-fly alignment search가 어려움

-> 그래서 VITS의 explicit text-side prosody modeling 문제를 해결한 Factorized-VITS를 제안

Factorized-VITS
- External, secondary aligner 대신 VITS의 alignment method를 채택
  - 이를 통해 additional parameter와 complexity의 필요성을 제거함
- 특히 hidden space를 factorizing하고 text subspace 내에서 alignment search를 수행
  - Alignment constraint에서 prosody를 decoupling 함으로써 higher speech variability를 capture 하는 frame-level prior를 directly modeling 함

< Overall of Factorized-VITS >

VITS에 explicit prosody modeling을 incorporate 한 TTS model
결과적으로 기존보다 뛰어난 합성 성능을 달성

2. Method

VITS와 Factorized-VITS는 variational inference criteria를 따름
- 이때 각각의 model에 대한 Evidence Lower BOund (ELBO)는:
  (Eq. 1) $\mathcal{L}_{VITS}=\log p(\mathbf{x}|\mathbf{c})=\log ( \int p(\mathbf{x}|\mathbf{z})p(\mathbf{z}|\mathbf{c})d\mathbf{z})$
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,=\log (\int p(\mathbf{x}|\mathbf{z})p(\mathbf{z}|\mathbf{c})\frac{q(\mathbf{z}|\mathbf{x})}{q(\mathbf{z}|\mathbf{x})}d\mathbf{z})$
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\geq \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}\left[\log \frac{p(\mathbf{x}|\mathbf{z})p(\mathbf{z}|\mathbf{c})}{q(\mathbf{z}|\mathbf{x})}\right]$
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\geq \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}\left[\log p(\mathbf{x}|\mathbf{z})\right]-D_{KL}\left( q(\mathbf{z}|\mathbf{x})||p(\mathbf{z}|\mathbf{c})\right)$
  (Eq. 2) $\mathcal{L}_{ours}=\log p(\mathbf{x}|\mathbf{c})=\log (\iint p(\mathbf{x}|\mathbf{z})p(\mathbf{z}|\mathbf{c},\mathbf{f})p(\mathbf{f}|\mathbf{c})d \mathbf{z}d\mathbf{f})$
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,=\log (\int [\int p(\mathbf{x}|\mathbf{z})p(\mathbf{z}|\mathbf{c},\mathbf{f})d\mathbf{z}]p(\mathbf{f}|\mathbf{c}) \frac{q(\mathbf{f}|\mathbf{x})}{q(\mathbf{f}|\mathbf{x})}d\mathbf{f})$
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\geq \mathbb{E}_{q(\mathbf{f}|\mathbf{x})}[\int p(\mathbf{x}|\mathbf{z})p(\mathbf{z}|\mathbf{c},\mathbf{f})\frac{q(\mathbf{z}|\mathbf{x})}{q(\mathbf{z}|\mathbf{x}) }d\mathbf{z}] -D_{KL}\left(q(\mathbf{f}|\mathbf{x})||p(\mathbf{f}|\mathbf{c})\right)$
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\geq \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}\left[\log p(\mathbf{x}|\mathbf{z})\right]-\mathbb{E}_{q(\mathbf{f}|\mathbf{x})}\left[D_{KL}\left( q(\mathbf{z}|\mathbf{x}) || p(\mathbf{z}|\mathbf{c},\mathbf{f})\right)\right]-D_{KL}\left(q(\mathbf{f}|\mathbf{x})|| p(\mathbf{f}|\mathbf{c})\right)$
  - $\mathbf{x}$ : audio, $\mathbf{c}$ : text input, $\mathbf{f}$ : prosody input (text, energy)
  - $D_{KL}$ : Kullback-Leibler divergence
- 여기서 모든 distribution은 diagonal covariance를 가진 Gaussian으로 가정됨
  1. 특히 $p(\mathbf{z}|\mathbf{c},\mathbf{f})$는 diagonal covariance를 가정하므로, $p(\mathbf{z}|\mathbf{c},\mathbf{f})=p(\mathbf{z}_{t}|\mathbf{c})p(\mathbf{z}_{p}|\mathbf{f})$가 성립함
    - $\mathbf{z}_{t}$ : text와 관련된 prior latent variable, $\mathbf{z}_{p}$ : prosody와 관련된 prior latent variable
  2. 추가적으로 explicit prosody modeling은 Period VITS와 같이 supervised manner로 수행됨
- 따라서 $q(\mathbf{f}|\mathbf{x})$는 deterministic operation $q(\mathbf{f}|\mathbf{x})=\delta(\mathbf{f}-\mathbf{f}_{gt})$로 취급할 수 있고 (Eq. 2)는 (Eq. 3)과 같이 simplify 됨:
  (Eq. 3) $\mathcal{L}_{ours}\geq\mathbb{E}_{q(\mathbf{z}|\mathbf{x})}\left[\log p(\mathbf{x}|\mathbf{z})\right]-\mathbb{E}_{q(\mathbf{f}|\mathbf{x})}\left[ D_{KL}\left( q(\mathbf{z}|\mathbf{x})|| p(\mathbf{z}|\mathbf{c},\mathbf{f})\right)\right]-D_{KL}\left( q(\mathbf{f}|\mathbf{x})||p(\mathbf{f}|\mathbf{c})\right)$
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\geq\mathbb{E}_{q(\mathbf{z}|\mathbf{x})}\left[\log p(\mathbf{x}|\mathbf{z})\right]-\mathbb{E}_{q(\mathbf{f}|\mathbf{x})}\left[D_{KL}\left( q(\mathbf{z}|\mathbf{x})|| p(\mathbf{z}_{t}|\mathbf{c})p(\mathbf{z}_{p}|\mathbf{f}|)\right)\right] -D_{KL}\left(q(\mathbf{f}|\mathbf{x})|| p(\mathbf{f}|\mathbf{c})\right)$
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\geq \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}\left[\log p(\mathbf{x}|\mathbf{z})\right]-D_{KL}\left( q(\mathbf{z}|\mathbf{x})||p(\mathbf{z}_{t}|\mathbf{c})p(\mathbf{z}_{p}|\mathbf{f}_{gt})\right)+ \log p(\mathbf{f}_{gt}|\mathbf{c})$
  - $\mathbf{f}_{gt}$ : ground-truth supervised prosody label
- (Eq. 3)과 (Eq. 1)을 비교해 보면, Factorized-VITS는 text, prosody prior hidden state를 separately represent 하기 위해 2개의 latent variable $\mathbf{z}_{t},\mathbf{z}_{p}$를 사용하여 prior audio hidden space를 factorize 함
  - 특히 (Eq. 3)에는 text에서 prosody를 predict 하기 위한 extra loss term인 $\log p(\mathbf{f}_{gt}|\mathbf{c})$가 포함됨

Factorized-VITS (좌) Training (우) Inference

- Text-Subspace Alignment

논문은 VITS의 Monotonic Alignment Search를 사용함
- BUT, 실질적으로는 (Eq. 4)와 같이 text subspace에서 동작하므로 approximation으로 얻어짐:
  (Eq. 4) $\arg\max_{\hat{A}}\log p_{\theta}(\mathbf{z}|\mathbf{c},\mathbf{f},\hat{A})=\arg\max_{\hat{A}}\log \mathcal{N}\left(f_{\theta}(\mathbf{z}); \mu_{\theta}(\mathbf{c},\mathbf{f},\hat{A}),\sigma_{\theta}( \mathbf{c},\mathbf{f},\hat{A})\right)$
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\approx\arg\max_{\hat{A}}\log \mathcal{N}\left(f_{\theta}(\mathbf{z})_{\{:text\}};\mu_{\theta}(\mathbf{c},\hat{A})_{\{:text\} },\sigma_{\theta}(\mathbf{c},\hat{A})_{\{:text\} }\right)$
- $\hat{A}$ : alignment, $\theta$ : Viterbi training의 fixed model weight, $*_{ \{:text\}}$ : vector의 text portion

- Frame-Prosody Prior Encoder

논문의 모든 prior, posterior distribution은 Gaussian distribution을 따른다고 가정함
- 따라서 prosody prior encoder는 Gaussian distribution이 아닌 pitch, energy contour를 Gaussian distribution으로 transform 하는 역할을 수행함
  - 이는 input을 higher-dimensional space에 project 함으로써 수행되고, prosody prior encoder는 구조적으로 2개의 Transformer layer로 구성됨
- 추가적으로 논문은 sampling 이후 frame-level prosody prior를 directly learning 하기 위해 prosody prior encoding을 수행함
  - 즉, prior encoder는 아래 그림과 같이 upsampled sequence를 통해 frame-level prosody prior를 학습함
- Frame-prosody prior encoder는 Factorized-VITS가 Gaussian distribution 가정을 adhere 하면서 frame-level prosodic information을 incorporate 하도록 함
  - 특히 Transformer layer는 prosodic feature 내의 intricate relationship을 capture 하여 natural, expressive speech synthesis를 지원함

Prior Encoder (좌) VITS (우) Factorized-VITS

- Joint Attribute Predictor with In-Context Learning

논문의 non-autoregressive model은 prosody control과 duration, text-side pitch, energy를 predict 해야 함
- 이를 위해 논문은 training phase에서 해당 attribute에 대한 mapping을 학습하는 Joint Attribute Predictor를 도입함
- 이때 Joint Attribute Predictor는 다음 두 가지 특징을 가짐:
  1. Joint Prediction
    - Predictor는 duration, pitch, energy를 simultaneously estimate 해야 함
    - 즉, 3가지 attribute에 대한 concatenation을 input으로 하여 prediction process에서 potential interaction과 correlation을 capture 하도록 함
  2. In-Context Learning Structure
    - VoiceBox의 regression duration predictor를 채택하여 previous chunk의 unmasked duration-pitch-energy로 condition 된 next coming text chunk를 predict 함
    - 이를 통해 fluent, consistent speech를 보장함

3. Experiments

- Settings

Dataset : LJSpeech
Comparisons : VITS

- Results

전체적으로 Factorized-VITS가 우수한 성능을 보임

Alignment Effectiveness
- Factorized-VITS는 기존보다 뛰어난 alignment accuracy를 달성함

Prosody Reconstruction Capability
- Pitch reconstruction 측면에서도 Factorized-VITS는 뛰어난 성능을 달성함

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] InstantSpeech: Instant Synchronous Text-to-Speech Synthesis for LLM-driven Voice Chatbots (0)	2025.05.20
[Paper 리뷰] DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech (0)	2025.05.15
[Paper 리뷰] VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech (0)	2025.05.12
[Paper 리뷰] NaturalSpeech3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models (0)	2025.05.04
[Paper 리뷰] NaturalSpeech2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers (0)	2025.05.03

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Factorized-VITS: Decoding Prosody and Text in End-to-End Speech Synthesis without External or Secondary Aligner

Factorized-VITS: Decoding Prosody and Text in End-to-End Speech Synthesis without External or Secondary Aligner

1. Introduction

2. Method

- Text-Subspace Alignment

- Frame-Prosody Prior Encoder

- Joint Attribute Predictor with In-Context Learning

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바