[Paper 리뷰] APTTS: Adversarial Post-Training in Latent Flow Matching for Fast and High-Fidelity Text-to-Speech

티스토리 뷰

Paper/TTS

[Paper 리뷰] APTTS: Adversarial Post-Training in Latent Flow Matching for Fast and High-Fidelity Text-to-Speech

feVeRin 2025. 8. 20. 17:01

APTTS: Adversarial Post-Training in Latent Flow Matching for Fast and High-Fidelity Text-to-Speech

Flow matching 기반의 Text-to-Speech model은 sampling step에 의존적임
APTTS
- Adversarial post-training strategy를 도입해 sampling step 수를 절감
- Pre-trained flow matching model을 few-step generator로 취급하고 reconstruction, adversarial objective를 통해 optimization을 수행
논문 (INTERSPEECH 2025) : Paper Link

1. Introduction

Zero-shot Text-to-Speech (TTS)는 single speech prompt만을 사용하여 distinctive vocal characteristic을 replicate 하는 것을 목표로 함
- NaturalSpeech2, NaturalSpeech3, DiTTo-TTS와 같이 대부분의 zero-shot TTS model은 latent diffusion model을 활용하여 well-regularized latent speech representation에 diffusion process를 적용함
  1. 또는 VoiceBox, F5-TTS, E2-TTS와 같이 Flow Matching을 zero-shot TTS에 활용할 수도 있음
  2. BUT, Flow Matching model은 mel-spectrogram과 같은 fixed intermediate representation에 의존하므로 task-specific representaiton을 학습하기 어려움
    - 이를 해결하기 위해 Latent Flow Matching을 활용할 수 있음
- 한편으로 Diffusion, Flow Matching model 모두 high-quality generation을 위해서는 상당한 iterative sampling step이 필요함
  - 이때 FlashSpeech, ReFlow-TTS, PeriodWave 등과 같이 few-step generator에 adversarial learning을 결합하면 sampling step acceleration이 가능함

-> 그래서 adversarial post-training을 통해 추론 속도를 accelerate 하는 APTTS를 제안

APTTS
- 추론 속도를 accelerate 하기 위해 Adversarial Post-Training (AP) strategy를 도입
  1. 먼저 Variational AutoEncoder (VAE)를 사용해 continuous latent speech representation을 학습하고, normalized latent를 approximate 하기 위해 Flow Matching model을 training 함
  2. Trained Flow Matching model을 few-step generator로 취급하여 reconstruction, adversarial obejctive를 통해 optimize 함
- 추가적으로 Classifier-Free Guidance (CFG)를 도입하여 few-step generator의 context, prompt fidelity를 향상

< Overall of APTTS >

AP strategy를 활용해 추론 속도를 개선한 zero-shot TTS model
결과적으로 기존보다 우수한 성능을 달성

2. Preliminary: Flow Matching

Flow Matching (FM)은 known initial distribution $p_{0}(x)$를 true data distribution $q(x)$와 approximate 한 target distribution $p_{1}(x)$로 transform 하는 probability path $p_{t}(x)$를 정의함
- 여기서 FM model은 neural network $\theta$에 의해 parameterize 되어 time-dependent vector field $u_{t}(x):[0,1]\times \mathbb{R}^{d}\rightarrow \mathbb{R}^{d}$를 construct 함
- 즉, Conditional Flow Matching (CFM)은 FM model이 conditional probability path $p_{t}(x|x_{1})$에서 derive 된 conditional vector field $u_{t}(x|x_{1})$을 estimate 하도록 training 됨:
  (Eq. 1) $ \mathcal{L}_{CFM}(\theta)=\mathbb{E}_{t,q(x_{1}),p_{t}(x|x_{1})} \left[\left|\left| u_{t}(x|x_{1})-v_{t}(x;\theta)\right|\right|^{2}\right]$
  - $t\sim \mathcal{U}[0,1]$, $x_{1}$ : $q(x)$에서 derive 되는 값, $v_{t}(x;\theta)$ : predicted vector field
- 추론 시에는 Ordinary Differential Equation (ODE) solver가 각 time step에서 current vector field의 learned FM model에 query 하여 $x_{0}\sim p_{0}(x)$로부터 $x_{1}$을 generate 함
- Conditional path에 대한 일반적인 choice로, standard Gaussian distribution $p_{0}(x)$에서 drawn 된 각 $x_{0}$를 target $x_{1}$으로 constant velocity로 transfer 하는 Optimal Transport path를 주로 사용함
  1. 여기서 vector field는 time-invariant이고 $u_{t}(x|x_{1})=x_{1}-(1-\sigma_{\min})x_{0}$와 같이 정의됨
  2. 해당 conditional probability path는 $p_{t}(x|x_{1})=\mathcal{N}(x|tx_{1}, (1-(1-\sigma_{\min})t)^{2}I)$와 같음
    - $\sigma_{\min}$ : small constant

3. Method

APTTS training은 3-stage로 구성됨
1. 먼저 VAE를 pre-train 하여 speech로부터 continuous latent space를 학습함
  - 이때 pitch information은 개별적으로 추출되어 VAE decoder에 전달됨
2. 이후 remaining module을 training 함
  - FM decoder는 learned latent를 approximate 하고 pitch predictor는 text, speech prompt로 condition 된 pitch information을 estimate 함
3. 최종적으로 FM decoder에 AP strategy를 적용함

- Pre-training VAE with Explicit Pitch Modeling

논문은 Period-VITS의 VAE architecture를 채택하여 encoder가 mel-spectrogram을 receive 하도록 modify 함
- 이때 explicit fundamental frequency $F0$와 binary voice flag를 개별적으로 추출하여 sinusoidal source signal을 생성한 다음, VAE decoder에 전달함
  - 해당 separated design은 pitch reconstruction을 개선하여 naturalness를 향상함
- 특히 APTTS는 conditional VAE를 사용하여 well-regularized latent space를 학습하는 것을 목표로 함
  1. 이를 위해 논문은 $64$ latent dimension을 사용하고, latent distribution과 standard Gaussian 간의 KL-divergence로 정의되는 regularization loss를 $1\times 10^{-4}$로 down-weight 함
  2. 추가적으로 truncated pointwise relative loss를 사용하는 Multi-Period Discriminator (MPD)와 Multi-Resolution Discriminator (MRD)를 도입함

- FM Decoder and Pitch Predictor

VAE pre-training 이후 learned latent와 pitch information을 esitmate 하기 위해 나머지 module을 training 함
- 먼저 speech prompt에서 추출한 global style embedding을 condition으로 하는 Transformer encoder를 사용하여 text sequence를 encoding 함
  - 특히 NeXt-TDNN-based style encoder는 mel-spectrogram으로부터 embedding을 생성함
- Self-supervised aligner는 text encoder output $c_{text}$와 frame-level latent $z_{1}$ 간의 alignment를 얻기 위해 jointly training 되고, duration predictor는 $c_{text}$로부터 text duration을 estimate 함
  - 이후 length regulator는 $c_{text}$를 frame-level $c$로 upsampling 하여 FM decoder와 pitch predictor 모두에서 사용되는 style-aware context embedding을 얻음
- FM decoder의 경우 Diffusion Transformer (DiT)를 backbone으로 사용함
  - Self-attention layer에는 RoPE가 integrate 되고, model parameter를 줄이기 위해 AdaLN-Single이 적용됨
- 특히 논문은 zero-shot TTS를 speech infilling task로 취급함
  1. 즉, DiTTo-TTS, VoiceBox와 같이 text와 surrounding speech segment를 사용하여 masked segment를 reconstruct 하는 방법을 학습함
  2. $z_{1}$을 learned latent, $z_{0}$를 standard Gaussian sample, $m$을 binary temporal mask라고 하면, optimal transport setting하에서 decoder는 다음을 receive 함:
    - Flow step $t\in [0,1]$
    - Noisy latent $z_{t}=(1-(1-\sigma_{\min})t)z_{0}+tz_{1}$
    - Latent prompt $p=(1-m)\odot z_{1}$
    - Context embedding $c$
  3. Flow step $t$는 sinusoidal positional encoding을 통해 embedding 되고 sequence $(z_{t},p,c)$는 concatenate 되어 DiT의 hidden dimension으로 project 됨
  4. Decoder는 $\mathcal{L}_{CFM}$ objective를 통해 training 되어 target vector field $v_{t}=z_{1}-(1-\sigma_{\min})z_{0}$를 estimate 함
    - Loss는 masked frame에 대해서만 compute 됨
    - 여기서 mask $m$은 $70\%\text{~}100\%$ 사이에서 uniformly sample 된 length로 random contiguous region을 cover 함
- Context, prompt fidelity를 향상하기 위해 논문은 FM decoder에 dual CFG strategy를 적용함
  1. Training 중에는 context embedding $c$가 $5\%$의 probability로 drop 되고, prompt $p$는 $10\%$, $(c,p)$는 $10\%$의 probability로 drop 됨
    - 해당 dropout scheme은 conditional, unconditional output을 모두 학습할 수 있도록 함
  2. 추론 시 latent는 ODE solver를 통해 time step $t$에서 vector field를 guide 하는 방식으로 sampling 됨:
    (Eq. 2) $\tilde{v}_{t}(z_{t},p,c)=v_{t}(z_{t},p,c)+ \alpha_{p}\left(v_{t}(z_{t},p,\varnothing)- v_{t}(z_{t},\varnothing, \varnothing)\right)+\alpha_{c} \left(v_{t}(z_{t},\varnothing,c)-v_{t}(z_{t},\varnothing,\varnothing)\right)$
    - $\varnothing$ : null condition, $\alpha_{p}, \alpha_{c}$ : guidance scale
- Pitch predictor는 $c$로부터 pitch information을 estimate 하고 FM decoder와 유사하게 speech prompt로 condition 됨
  - 이때 동일한 mask $m$이 pitch prediction loss에 적용되고, first 20-bin의 surrounding segment $(1-m)$을 prompt로 사용하여 pitch predictor input인 $c$와 concatenate 됨

- Adversarial Post-Training

추론을 accelerate 하기 위해 논문은 trained FM decoder에 Adversarial Post-Training (AP) strategy를 적용함
- 해당 stage에서 FM decoder는 fixed time step으로 ODE solver를 사용하는 few-step generator로 repurpose 되어 training-inference mismatch를 reduce 함
  1. Reconstruction loss를 사용하여 output latent를 target과 directly align 하고 limited sampling step에서 generated latent distribution을 align 하기 위해 discriminator를 통한 adversarial loss를 incorporate 함
  2. 이를 위해 논문은 GANSpeech architecture를 기반으로 하는 Joint Conditional and Unconditional (JCU) discriminator를 도입함
    - 특히 generated speech와 prompt 간의 similarity를 향상하기 위해 discriminator를 prompt $p$의 time-averaged embedding에 condition 하고, strided convolution을 dilated convolution으로 replace 함
- 결과적으로 overall AP loss는:
  (Eq. 3) $\mathcal{L}_{AP} =m\odot \left(\mathcal{L}_{recon}+\mathcal{L}_{JCU}+\lambda_{fm} \mathcal{L}_{fm}\right)$
  - $\mathcal{L}_{recon}$ : $L1$ reconstruction loss, $\mathcal{L}_{JCU}, \mathcal{L}_{fm}$ : least-squares adversarial loss, feature matching loss, $m$ : mask
Hybrid CFG
- 기존 few-step acceleration method는 training 중에 fixed CFG scale을 활용함
  - BUT, conditional, unconditional output을 모두 compute 해야 하므로 training time이 늘어남
- 한편으로 large guidance scale을 사용하는 경우, CFG의 error accumulation으로 인해 discrepancy가 발생함
- 따라서 이를 해결하기 위해 논문은 Hybrid CFG technique을 도입함
  1. AP에서 decoder는 항상 prompt $p$와 context embedding $c$에 따라 condition 됨
    - 추론 시에는 AP 적용 전의 checkpoint를 사용하는 base FM decoder output을 사용하여 few-step generator의 vector field를 refine 함
  2. 여기서 (Eq. 2)의 $v_{t}(z_{t},p,c)$를 제외한 모든 term은 각 time step에서 base model로부터 derive 됨
    - 해당 hybrid CFG strategy를 통해 더 나은 prompt, context fidelity를 달성할 수 있음

- Inference

Speech prompt $y_{p}$와 해당 text $x_{p}$가 주어지면, APTTS는 target text $x_{t}$에 대해 $y_{p}$의 characteristic을 emulate 하는 speech를 synthesis 하는 것을 목표로 함
- 이를 위해 먼저 $x_{p}$를 encode 하고 aligner를 사용하여 $z_{p}$에 대한 alignment를 얻음
  - 이때 $z_{p}$는 VAE encoder를 통해 $y_{p}$에서 추출됨
- 이후 text encoder를 사용해 concatenated sequence $(x_{p},x_{t})$를 process 하고 resulting output을 upsample 하여 context embedding $c$를 얻음
  - 여기서 $x_{p}$의 initial segment는 pre-obtained alignment를 따르고, $x_{t}$의 subsequent segment는 duration predictor가 predict 한 duration에 따라 upsampling 됨
- 다음으로 prompt와 $c$에 대한 condition을 사용하여, FM decoder와 pitch predictor는 각각 latent, pitch information을 생성함
  - FM decoder의 경우 prompt는 $z_{p}$와 같고, pitch predictor의 경우 $y_{p}$의 first 20-bin과 같음
- 최종적으로 output에서 $y_{p}$의 length에 해당하는 initial segment는 discard 되고 나머지 output은 VAE decoder를 통해 waveform으로 변환됨

4. Experiments

- Settings

Dataset : LibriTTS, VCTK, HiFi-TTS
Comparisons : YourTTS, F5-TTS, VALL-E, VoiceBox, CLaM-TTS, DiTTo-TTS, FlashSpeech, HierSpeech++

- Results

APTTS는 뛰어난 합성 품질을 달성할 수 있음

Objective evaluation 측면에서도 우수한 성능을 보임

AP Scalability
- AP strategy를 적용하면 flow matching model의 성능을 향상할 수 있음

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] GST-BERT-TTS: Prosody Prediction without Accentual Labels for Multi-Speaker TTS using BERT with Global Style Tokens (0)	2025.09.01
[Paper 리뷰] EATS-Speech: Emotion-Adaptive Transformation and Priority Synthesis for Zero-Shot Text-to-Speech (0)	2025.08.25
[Paper 리뷰] EE-TTS: Emphatic Expressive TTS with Linguistic Information (0)	2025.07.24
[Paper 리뷰] EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis (0)	2025.07.23
[Paper 리뷰] RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching (0)	2025.07.15

최근에 올라온 글

최근에 달린 댓글

« 2026/02 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] APTTS: Adversarial Post-Training in Latent Flow Matching for Fast and High-Fidelity Text-to-Speech

APTTS: Adversarial Post-Training in Latent Flow Matching for Fast and High-Fidelity Text-to-Speech

1. Introduction

2. Preliminary: Flow Matching

3. Method

- Pre-training VAE with Explicit Pitch Modeling

- FM Decoder and Pitch Predictor

- Adversarial Post-Training

- Inference

4. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바