[Paper 리뷰] DMP-TTS: Disentangled Multi-Modal Prompting for Controllable Text-to-Speech with Chained Guidance

티스토리 뷰

Paper/TTS

[Paper 리뷰] DMP-TTS: Disentangled Multi-Modal Prompting for Controllable Text-to-Speech with Chained Guidance

feVeRin 2026. 3. 11. 11:08

DMP-TTS: Disentangled Multi-Modal Prompting for Controllable Text-to-Speech with Chained Guidance

Controllable Text-to-Speech는 attribute entanglement로 인한 한계점을 가지고 있음
DMP-TTS
- CLAP-based style encoder를 활용해 reference audio와 descriptive text의 cue를 align 하고 style attribute에 대한 contrastive learning과 multi-task supervision으로 training
- 추론 시에는 chained Classifier-Free Guidance를 도입하여 style guidance strength를 independently adjust
- 추가적으로 Representation Alignment를 활용하여 stable, accelerated training을 지원
논문 (ICASSP 2026) : Paper Link

1. Introduction

Controllable Text-to-Speech (TTS)는 reference audio, discrete label, descriptive text 등을 활용하여 naturalness와 controllability를 향상함
- BUT, style과 timbre의 entanglement로 인해 attribute를 independently control하는데 한계가 있음
- 이를 위해 ControlSpeech와 같이 audio, descriptive text에 대한 multi-modal style prompt를 활용할 수 있지만, text prompt에 identity-related cue가 포함될 수 있다는 단점이 있음

-> 그래서 controllable TTS의 multi-modal prompting을 개선한 DMP-TTS를 제안

DMP-TTS
- Diffusion Transformer (DiT)와 CLAP-based multi-modal style encoder인 Style-CLAP을 도입
- Chained Classifier-Free Guidance (cCFG)를 활용해 fine-grained control을 지원하고 Representation Alignment (REPA)를 통해 stability를 향상

< Overall of DMP-TTS >

Explicit disentanglement와 DiT architecutre를 활용한 controllable TTS model
결과적으로 기존보다 우수한 성능을 달성

2. Preliminary

논문은 latent DiT를 training하기 위해 Conditional Flow Matching (CFM)을 도입함
- 먼저 CFM은 linear interpolation path를 통해 noise sample $\mathbf{z}_{1}$, data sample $\mathbf{z}_{0}$ 간의 continuous-time flow를 구성함:
  (Eq. 1) $ \mathbf{z}_{t}=(1-t)\mathbf{z}_{1}+t\mathbf{z}_{0},\,\,\, t\in[0,1]$
- 해당 path를 따라 velocity는 다음과 같이 정의됨:
  (Eq. 2) $\mathbf{u}=\mathbf{z}_{0}-\mathbf{z}_{1}$
- 여기서 논문은 DiT를 velocity network $v_{\theta}(\mathbf{z}_{t},\mathbf{c},t)$로 parameterize 하고, ground-truth velocity field를 approximate 하도록 training 함:
  (Eq. 3) $\mathcal{L}_{flow}=\mathbb{E}_{t,\mathbf{z}_{0},c}\left[\left|\left| v_{\theta}\left(\mathbf{z}_{t},c,t\right)-\mathbf{u}\right|\right|^{2}\right]$
  - $c$ : conditioning information

3. Method

- Overview

DMP-TTS는 mel-spectrogram latent에 대한 stacked DiT block을 활용해 Gaussian noise를 target latent representation으로 mapping 함
- Conditioning은 content text, timbre, style의 3 complementary input으로부터 얻어짐
  - 각각 text encoder, speaker encoder, style encoder로 process 됨
- Text, style embedding에 condition 된 duration predictor는 phoneme-level alignment를 control 함

- Unified Multi-Modal Style Encoder

Speaking stlye control을 위해 논문은 pre-trained CLAP-based unified style encoder인 Style-CLAP을 도입함
- 먼저 emotion, energy, speech rate에 대한 textual style label을 curate 한 다음, speaker identity와 overlap 되지 않도록 unrelated descriptor (age, gender, pitch)를 exclude 함
- Audio, text style embedding 간의 alignment를 refine 하기 위해 InfoNCE loss로 fine-truning 하고, 이때 contrastive loss $\mathcal{L}_{con}$은 다음과 같이 얻어짐:
  (Eq. 4) $ \mathcal{L}_{con}=-\mathbb{E}\left[\log \frac{\exp\left(\text{sim}(\mathbf{h}_{a},\mathbf{h}_{t})/\tau\right)}{\sum_{j=1}^{N}\exp\left( \text{sim}(\mathbf{h}_{a},\mathbf{h}_{t,j})/\tau\right)}\right]$
  - $\mathbf{h}_{a}=\mathcal{E}_{a}(\mathbf{A}_{i})$, $\mathbf{h}_{t}=\mathcal{E}_{t}(\mathbf{T}_{i})$ : audio encoder $\mathcal{E}_{a}$, text encoder $\mathcal{E}_{t}$에서 추출된 embedding
  - $\text{sim}(\cdot,\cdot)$ : cosine-similarity, $\tau$ : temperature
- Contrastive loss 만으로는 learned representation이 specific style attribute에 대해 discriminative 하다는 것을 보장하지 못하므로, audio branch에 multi-task supervision을 incorporate 함
  1. 특히 discrete attribute에 대해서는 Cross-Entropy loss $\mathcal{L}_{ce}$를 사용하고 continuous attribute에 대해서는 Mean Squared Error loss $\mathcal{L}_{mse}$를 사용함
  2. 결과적으로 overall training objective는:
    (Eq. 5) $\mathcal{L}_{style}=\mathcal{L}_{con}+\lambda_{c}\mathcal{L}_{ce}+\lambda_{m}\mathcal{L}_{mse}$
    - $lambda_{c},\lambda_{m}$ : balancing parameter

- Chained Classifier-Free Guidance

CFG는 random condition dropping을 통해 generation quality를 향상함
- BUT, standard all-or-nothing setup은 global unconditional branch만을 제공하므로 attribute disentanglement의 한계가 있음
- 따라서 DMP-TTS는 content, timbre, style을 independently control 할 수 있는 chained Classifier-Free Guidance (cCFG)를 도입함
  - DiT training 시 Vevo의 information level을 따라 hierarchical condition dropout을 수행하여 semantic text을 high-level로 acoustic attribute를 lower-level로 처리함
- 먼저 style condition $c_{style}$을 probability $p_{style}$로 drop 하고, drop 되는 경우 timbre condition $c_{spk}$를 probability $p_{spk}$로 drop 함
  1. 이후 style, timbre가 모두 drop 되었다면 text condition $c_{text}$를 probability $p_{text}$로 drop 함
  2. 추가적으로 training 시 same speaker의 다른 utterance를 speaker encoder에 randomly feed 하는 style perturbation을 적용하여 timbre branch를 regularize 하고 style leakage를 reduce 함
  3. 추론 시에는 chained guidance가 가능하고, final prediction $\hat{v}$는 다음과 같이 얻어짐:
    (Eq. 6) $\hat{v}=v(\varnothing) +s_{text}[v(c_{text})-v(\varnothing)]+ s_{spk}[v(c_{text},c_{spk})-v(c_{text})]$
    $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,+s_{style}[v(c_{text},c_{spk},c_{style})-v(c_{text},c_{spk})]$
    - $v(\cdot)$ : predicted velocity field
    - Gudiance scale $s_{text},s_{spk},s_{style}$을 tuning 하여 각 attribute를 independently contorl 할 수 있음

- Representation Alignment

논문은 multi-conditional TTS model을 stablize 하기 위해 Representation Alignment (REPA) strategy를 도입함
- REPA는 pre-trained model의 knowledge를 incorporate 하여 DiT에 acoustic-semantic prior를 inject 함
- 특히 논문은 Whisper의 audio encoder를 teacher로 사용하여 DiT의 intermediate representation을 guide 함
  1. 먼저 Whisper의 final-layer output을 teacher representation $\mathbf{h}_{whisper}\in\mathbb{R}^{T_{w}\times D_{w}}$, DiT의 intermediate-layer output을 student representation $\mathbf{h}_{DiT}\in\mathbb{R}^{T_{d}\times D_{d}}$라고 하자
  2. 두 sequence의 length와 feature dimension이 다르므로, temporal axis를 따라 $\mathbf{h}_{DiT}$를 upsampling 한 다음, target dimension $D_{w}$에 맞게 linear projection $\mathcal{P}$를 적용함
  3. Alignment는 cosine-similarity를 minimize 하여 얻어짐:
    (Eq. 7) $\mathcal{L}_{repa}=1-\mathbb{E}_{t}\left[\text{sim}\left(\mathcal{P} \left(\text{Upsample}\left(\mathbf{h}_{DiT}\right)\right),\left(\mathbf{h}_{whisper}\right)_{t}\right)\right]$
    - $\text{sim}(\cdot, \cdot)$ : cosine-similarity, $\mathbb{E}_{t}$ : temporal dimension $t$에 대한 expectation

4. Experiments

- Settings

Dataset : Chinese speech dataset (internal)
Comparisons : CosyVoice, CosyVoice2, IndexTTS2

- Results

전체적으로 DMP-TTS의 성능이 가장 우수함

Ablation Study
- 각 component를 제거하는 경우 성능 저하가 발생함

Effect of CFG Guidance Scales
- Guidance scale이 증가할수록 speaker similarity, emotion accuracy가 증가함

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency (0)	2026.03.23
[Paper 리뷰] DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis (0)	2026.03.18
[Paper 리뷰] EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS (0)	2026.03.05
[Paper 리뷰] PFluxTTS: Hybrid Flow-Matching TTS with Robust Cross-Lingual Voice Cloning and Inference-Time Model Fusion (0)	2026.02.26
[Paper 리뷰] ARCHI-TTS: A Flow-Matching-Based Text-to-Speech Model with Self-Supervised Semantic Aligner and Accelerated Inference (0)	2026.02.13

최근에 올라온 글

최근에 달린 댓글

« 2026/07 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] DMP-TTS: Disentangled Multi-Modal Prompting for Controllable Text-to-Speech with Chained Guidance

DMP-TTS: Disentangled Multi-Modal Prompting for Controllable Text-to-Speech with Chained Guidance

1. Introduction

2. Preliminary

3. Method

- Overview

- Unified Multi-Modal Style Encoder

- Chained Classifier-Free Guidance

- Representation Alignment

4. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

티스토리 뷰

[Paper 리뷰] DMP-TTS: Disentangled Multi-Modal Prompting for Controllable Text-to-Speech with Chained Guidance

document.addEventListener("DOMContentLoaded", function() { renderMathInElement(document.body, { delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false} ] });});

DMP-TTS: Disentangled Multi-Modal Prompting for Controllable Text-to-Speech with Chained Guidance

1. Introduction

2. Preliminary

3. Method

- Overview

- Unified Multi-Modal Style Encoder

- Chained Classifier-Free Guidance

- Representation Alignment

4. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바