[Paper 리뷰] FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representation

티스토리 뷰

Paper/TTS

[Paper 리뷰] FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representation

feVeRin 2026. 6. 8. 10:57

FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representations

Zero-shot Text-to-Speech는 여전히 independent, precise control 측면에서 한계가 있음
FC-TTS
- 2-stage spectrogram generation pipeline과 VQ-VAE-based style encoder를 도입
- 추가적으로 conditioning-aware consistency loss를 도입해 attribute separation과 dual-reference control의 reliability를 향상
논문 (ACL 2026) : Paper Link

1. Introduction

Zero-shot Text-to-Speech (TTS)는 example utterance를 condition으로 style flexibility를 제공할 수 있음
- BUT, 대부분의 zero-shot TTS model은 style, timbre의 entanglement로 인해 independent control이 어려움
- 이를 위해 NANSY++, NaturalSpeech3와 같은 disentangled approach를 고려할 수 있지만, 여전히 imperfect disentanglement와 unseen style-timbre combination에 대한 robustness의 한계가 존재함

-> 그래서 disentangled representation을 modeling을 개선하여 controllability를 향상한 FC-TTS를 제안

FC-TTS
- Disentangled representation을 기반으로 2-stage spectrogram generation pipeline을 도입
- Fine-grained, intra-utterance style variability를 capture 하는 VQ-VAE-based style encoder와 disentangled control을 지원하는 Conditioning-aware Consistency Loss를 도입

< Overall of FC-TTS >

2-stage pipeline과 disentangled representation을 활용한 zero-shot controllable TTS model
결과적으로 기존보다 우수한 성능을 달성

2. Preliminary

- Factorized Speech Codec

Timbre, style controllable TTS를 위해 FACodec을 고려할 수 있음
- 해당 codec은 speech signal을 discrete token의 multiple disentangled stream으로 factorize 하고, 각 stream은 distinct speech attribute를 capture 함
  - Prosody token $\mathbf{c}_{p}$, content token $\mathbf{c}_{c}$, acoustic detail token $\mathbf{c}_{d}$
- 각 stream은 time step $T$, residual quantization level $N_{p}=1, N_{c}=2, N_{d}=3$에 대해 $\mathbb{Z}^{N_{*}\times T}$와 같이 represent 되고, speaker timbre는 continuous global embedding $z_{spk}\in\mathbb{R}^{D}$으로 capture 됨
- 이때 논문은 $z_{spk}, \mathbf{c}_{p}$만 condition으로 사용하고 content token $\mathbf{c}_{c}$, acoustic detail token $\mathbf{c}_{d}$는 information leakage를 방지하기 위해 exclude 됨

- Flow-Matching TTS

Conditional TTS는 phoneme sequence $\mathbf{y}\in\mathbb{Z}^{L}$, speaker timbre와 같은 conditioning information $\mathbf{c}$가 주어졌을 때 log mel-spectrogram과 같은 target speech representation $\mathbf{x}\in\mathbb{R}^{F\times T}$를 생성함
- 해당 conditional generation을 위해 Conditional Flow-Matching (CFM)을 채택할 수 있음
  1. CFM은 isotropic Gaussian $\mathcal{N}(0,I)$와 같은 simple prior $p_{1}(\mathbf{x})$에서 target conditional distribution $p_{0}=p(\mathbf{x}|\mathbf{y},\mathbf{c})$로의 continuous-time transformation을 define 함
  2. $p_{1}$에서 $p_{0}$로의 progression을 describe 하기 위해 CFM은 time $t\in[0,1]$을 따라 sample을 transport 하는 time-dependent flow $\phi_{t}:[0,1]\times \mathbb{R}^{d}\rightarrow \mathbb{R}^{d}$를 도입함
    - 각 step은 marginal distribution $p_{t}(\mathbf{x})$를 가짐
  3. 해당 flow는 각 point에서 instantaneous direction을 specify 하는 velocity field $v_{t}(\mathbf{x}):[0,1]\times \mathbb{R}^{d}\rightarrow \mathbb{R}^{d}$에 의해 drive 되고, relationship은 Ordinary Differential Equation (ODE)를 통해 govern 됨:
    (Eq. 1) $ \frac{d}{dt}\phi_{t}(\mathbf{x})=v_{t}(\phi_{t}(\mathbf{x})), \,\,\, \phi_{1}(\mathbf{x})=\mathbf{x}_{1}$
- True $v$는 unavailable 하므로 CFM은 conditional vector field $v_{t}(\mathbf{x}|\mathbf{x}_{0})$로 $u_{\theta}(\mathbf{x},t,\mathbf{y},\mathbf{c})$를 training 하여 이를 approximate 함
  1. 이때 straight-line Optimal Transport (OT) trajectory가 가장 efficient 하고, 해당 ground-truth velocity는:
    (Eq. 2) $v_{t}^{OT}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathbf{x}_{0}-\mathbf{x}_{1}$
    - $\mathbf{x}_{t}=(1-t)\mathbf{x}_{1}+t\mathbf{x}_{0}$
  2. 이후 model은 (Eq. 3)의 loss를 minimize 하여 predicted velocity $u_{\theta}$를 OT velocity와 align 함:
    (Eq. 3) $\mathcal{L}_{CFM}=\mathbb{E}_{t,\mathbf{x}_{0},\mathbf{x}_{1}}\left[\left|\left| u_{\theta}(\mathbf{x}_{t},t,\mathbf{y},\mathbf{c}) - (\mathbf{x}_{0}-\mathbf{x}_{1})\right|\right|^{2}\right]$

3. Method

FC-TTS에서 timbre condition $z_{spk}$와 style condition $\mathbf{c}_{p}$는 training시 same target에서 extract 되고 추론 시에는 서로 다른 utterance의 reference를 사용함
- 구조적으로 FC-TTS는 두 condition을 sequentially process 함:
  1. Timbre stage에서는 $z_{spk}$를 통해 timbre characteristic을 anchor 하여 blurry spectrogram을 생성함
  2. Style stage에서는 $\mathbf{c}_{p}$를 통해 prosodic characteristic을 imprint 하여 refine 함
- 해당 2-stage framework를 통해 각 reference condition이 intended step에서만 반영되도록 함

- Hierarchical Spectrogram Generation

NaturalSpeech3와 같이 FACodec decoder를 simply reusing 하면 unseen combination에 대한 robust generation을 보장할 수 없으므로 independent timbre-prosody control이 어려움
- 이를 위해 논문은 jointly trained CFM speech decoder를 incorporate 해 hierarchical log mel-spectrogram generation을 수행함
- 먼저 timbre information을 사용하여 blurry spectrogram $\mathbf{h}$를 생성하고, CFM decoder를 통해 style information을 사용하여 complete spectrogram $\mathbf{x}_{0}$로 refine 함
  1. 해당 step은 blurry spectrogram에 대한 Mean Absolute Error (MAE) loss, final output에 대한 CFM loss로 jointly training 됨
    - 특히 MAE objective $\mathcal{L}_{blur}=\mathbb{E}[||\mathbf{h}-\mathbf{x}_{0}||]$는 over-smoothed output을 encourage 함
  2. 추가적으로 information leakage를 방지하기 위해 $z_{spk}$는 same long audio file의 다른 utternace로 randomly replace 됨
- 결과적으로 timbre adapter는 first stage에서 $z_{spk}$를 inject 하여 timbre characteristic을 anchor 하고 style adapter는 $\mathbf{c}_{p}$를 subsequently apply 해 각 reference가 dedicated pathway에만 influence 하도록 보장함

- VQ-VAE Style Encoding

Zero-shot TTS model은 In-Context Learning (ICL)을 활용해 voice characteristic을 mimic 함
- BUT, ICL은 timbre, style이 consistent 하다고 가정하므로 single utterance 내에서 speaking style이 varying 하는 것을 반영할 수 없음
- 이를 위해 논문은 training 시 target speech에서 extract 된 style representation을 condition으로 제공함
  - 이때 model은 higher-level prosodic pattern을 capture 하지 않고 style reference의 surface-level acoustic feature를 copying 하여 shortcut으로 사용할 수 있음
- 따라서 Transformer encoder, Cross-attention, Finite scalar quantization (FSQ) layer를 combine 한 TCF style encoder를 도입해 phoneme, frame level에서 style representation을 hierarchically modeling 함
  1. Prosody-only Representation
    - 논문은 TCF input으로 prosody token $\mathbf{c}_{p}$만 사용하고, content token $\mathbf{c}_{c}$, acoustic detail token $\mathbf{c}_{d}$는 exclude 함
    - 이를 통해 style encoder가 unintended information을 encoding 하지 않고 rhythmic, intonational pattern만 capture 하도록 할 수 있음
  2. Q-Former Bottleneck
    - Learned query token의 fixed set은 cross-attention을 통해 variable-length encoder output에 attend 하여 fixed latent token으로 compress 됨
    - 해당 bottleneck은 frame-level temporal detail을 discard 하고 high-level stylistic structure만 retain 하도록 force 하여 model이 specific acoustic realization에 overfit 되는 것을 방지함
  3. Vector Quantization
    - Q-Former를 통해 생성된 continuous latent token은 FSQ를 통해 further discretize 됨
    - FSQ는 low-level acoustic residual을 suppress 하고 semantically meaningful style code에 commit 하는 information bottleneck으로 사용됨

- Conditional Consistency Loss

Disentangled TTS에서 condition consistency를 향상하기 위해 논문은 Conditional Consistency Loss (CCL)을 도입함
- 먼저 CFM objective를 reparameterize 하여 FC-TTS decoder가 vector field 대신 log mel-spectrogram을 directly generate 하도록 구성함
  1. 이후 해당 spectrogram으로부터 두 attribute predictor를 training 하여 conditioning prosody token $\mathbf{c}_{p}$, speaker embedding $z_{spk}$를 predict 함
  2. 이때 각 predictor는 non-target conditioning signal도 receive 하여 prosody predictor에는 $z_{spk}$를 timbre predictor에는 $\mathbf{c}_{p}$를 feed 함
- 그러면 CCL은 prosody prediction을 위한 cross-entropy loss, speaker embedding consistency를 위한 negative cosine similarity의 weighted summation으로 얻어짐:
  (Eq. 4) $\mathcal{L}_{CCL}=\lambda_{ccl\text{-}pro}\cdot\mathbb{E}\left[\text{CE}\left(\mathbf{c}_{p}, f\left( \hat{\mathbf{x},z_{spk}}\right)\right)\right]-\lambda_{ccl\text{-}spk}\cdot\mathbb{E}\left[\cos\left( z_{spk},g\left(\hat{\mathbf{x}},\mathbf{c}_{p}\right)\right)\right]$
  - $\text{CE}(\cdot, \cdot)$ : cross-entropy loss, $f(\cdot)$ : prosody predictor, $g(\cdot)$ : speaker embedding predictor

4. Experiments

- Settings

Dataset : LibriHeavy
Comparisons : NaturalSpeech3, F5-TTS, DiTTo-TTS, CLaM-TTS

- Results

전체적으로 FC-TTS의 성능이 가장 우수함

Timbre controllability 측면에서도 FC-TTS가 더 뛰어남

Prosody control에서도 더 나은 성능을 보임

AudioLLM-as-a-Judge 측면에서도 FC-TTS가 더 선호됨

Ablation Study
- 각 component는 성능 향상에 유효함

특히 2-stage design을 활용하면 timbre combination에 대한 robustness를 향상할 수 있음

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment (0)	2026.07.02
[Paper 리뷰] TED-TTS: Training-Free Intra-Utterance Emotion and Duration Control for Text-to-Speech Synthesis (0)	2026.06.30
[Paper 리뷰] SyncSpeech: Efficient and Low-Latency Text-to-Speech based on Temporal Masked Transformer (0)	2026.05.19
[Paper 리뷰] Int-MeanFlow: Few-Step Speech Generation with Integral Velocity Distillation (0)	2026.05.15
[Paper 리뷰] IPACue-TTS: Integrating Prosody and Articulatory Cues in Conditional Flow Matching for Multilingual Zero-Shot TTS (0)	2026.05.14

최근에 올라온 글

최근에 달린 댓글

« 2026/07 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representation

FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representations

1. Introduction

2. Preliminary

- Factorized Speech Codec

- Flow-Matching TTS

3. Method

- Hierarchical Spectrogram Generation

- VQ-VAE Style Encoding

- Conditional Consistency Loss

4. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

티스토리 뷰

[Paper 리뷰] FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representation

document.addEventListener("DOMContentLoaded", function() { renderMathInElement(document.body, { delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false} ] });});

FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representations

1. Introduction

2. Preliminary

- Factorized Speech Codec

- Flow-Matching TTS

3. Method

- Hierarchical Spectrogram Generation

- VQ-VAE Style Encoding

- Conditional Consistency Loss

4. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바