[Paper 리뷰] IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech

티스토리 뷰

Paper/TTS

[Paper 리뷰] IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech

feVeRin 2026. 4. 2. 13:42

IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech

Large-scale autoregressive Text-to-Speech model은 token-by-token generation으로 인해 synthesized speech의 duration을 control 하기 어려움
IndexTTS2
- Token 수를 explicitly specify 하거나 autoregressive manner로 freely generate 하여 duration을 control
- Emotional expression, speaker identity 간의 disentanglement를 통해 timbre, emotion control을 지원하고 GPT latent representation을 incorporate 하여 clarity와 expression을 향상
- 추가적으로 Qwen3를 활용한 text description 기반의 soft instruction mechanism을 도입
논문 (AAAI 2026) : Paper Link

1. Introduction

XTTS, CosyVoice와 같은 autoregressive zero-shot Text-to-Speech (TTS) model은 random sampling strategy와 token-by-token generation을 활용해 뛰어난 naturalness와 expressiveness를 달성함
- BUT, autoregressive model은 sequential generation nature로 인해 duration control의 한계가 있음
- 추가적으로 대부분의 zero-shot TTS는 emotional dataset의 scarcity로 인해 emotion expression이 제한적임
  - 이를 해결하기 위해 CLAP을 통해 emotion audio와 natural language description을 mapping 하거나, instruction fine-tuning을 적용할 수 있지만, 여전히 control precision의 한계가 존재함

-> 그래서 autoregressive model의 duration control과 emotional expressiveness를 향상한 IndexTTS2를 제안

IndexTTS2
- Text, style prompt, speech token으로부터 semantic token을 생성하는 Text-to-Semantic (T2S) module과 해당 semantic token으로부터 mel-spectrogram을 생성하는 Semantic-to-Mel (S2M) module을 도입
- 추가적으로 DeepSeek-r1의 emotion distribution prediction ability를 Qwen3-1.7b에 distill 하고, 해당 probability를 emotion embedding과 combine하는 Text-to-Emotion (T2E) module을 구성

< Overall of IndexTTS2 >

T2S, S2M, T2E module을 활용한 emotion-controllable autoregressive zero-shot TTS model
결과적으로 기존보다 우수한 성능을 달성

2. Method

IndexTTS2는 Text-to-Semantic (T2S) module, Semantic-to-Mel (S2M) module, vocoder로 구성됨
- T2S module은 target text, style/timbre prompt, optional speech token count를 기반으로 semantic token을 생성하고 S2M module은 해당 token과 timbre prompt를 사용하여 mel-spectrogram을 predict 함
  - 이후 BigVGAN vocoder는 mel-spectrogram을 speech waveform으로 convert 함
- 이때 natural language-based emotional control을 위한 Text-to-Emotion (T2E) module을 도입하여 explicit natural language instruction/reference audio input을 통해 flexible emotional TTS를 지원함

- Autoregressive Text-to-Semantic Module (T2S)

논문은 T2S를 autoregressive semantic token prediction task로 formulate 함
- Input sequence는 $\left[c,p,e_{\langle BT\rangle}, E_{text}, e_{\langle BA\rangle}, E_{sem}\right]$과 같이 주어짐
  - $c$ : speaker attribute, $p$ : duration control, $E_{text}$ : text embedding, $E_{sem}$ : semantic token embedding
- $e_{\langle BT\rangle}, e_{\langle BA\rangle}$은 각각 text sequence, semantic sequence를 demarcate 하는 dedicated boundary token에 해당함
Duration Control
- Duration regulation은 $p=W_{num}h(T)$와 같이 target semantic token length $T$에서 compute 된 dedicated embedding $p$를 통해 수행됨
  - $W_{num}\in\mathbb{R}^{L_{speech}\times D}$ : embedding table, $L_{speech}$ : maximum semantic sequnece length, $D$ : embedding dimension
  - $h(T)$ : $T$에 대한 one-hot vector를 return 하는 역할
- 여기서 논문은 $W_{num}$과 semantic positional embedding table $W_{sem}$ 간에 constraint $W_{sem}=W_{num}$을 impose 함
  - 이를 통해 autoregressive system은 generation 시 positional information을 target duration information과 precisely align 하여 desired length에 맞는 sequence를 생성할 수 있음
Emotional Control
- Emotion synthesis는 $\left[c+e, p,e_{\langle BT\rangle}, E_{text}, e_{\langle BA\rangle}, E_{sem}\right]$과 같이 emotion embedding $e$를 input sequence에 integrate 하여 수행됨
  - $e$ : Conformer-based emotion perceiver conditioner를 사용하여 style prompt로부터 추출됨
- 특히 emotional rhythm representation을 capture 하기 위해:
  1. 먼저 speaker feature $c$는 pre-trained speaker perceiver conditioner를 통해 추출되어 timbral characteristic을 encode 함
  2. 이후 $e,c$ 간의 content overlap을 minimize 하면서 feature disentanglement를 향상하기 위해, training 시 GRL을 적용함
    - 해당 adversarial training을 통해 $e$를 emotional, rhythmic attribute만 exclusively capture 하도록 유도하여 global emotional prosody generation에 대한 precise, robust control을 보장함
Training and Inference
- Dataset에서 각 speaker는 2개 이상의 utterance를 가지고 있으므로, prompt, target partitioning을 위해 same speaker의 서로 다른 utterance를 prompt와 training target으로 나눔
  - 이후 data diversity를 위해 real speech, prompt 모두에 scaling coefficient $r_{1}, r_{2}$ 통해 random speed perturbation을 적용함
- 이를 기반으로 T2S module의 training은 3-stage로 수행됨:
  1. Stage 1
    - 먼저 module은 speaker embedding $c$, duration embedding $p$에 대해 input sequence $\left[c, p, e_{\langle BT\rangle},E_{text}, e_{\langle BA\rangle}, E_{sem}\right]$로 training 됨
    - 이때 duration control, free-form generation을 위해 $p=30\%$ probability로 randomly zero-ing 됨
  2. Stage 2
    - Stage 2에서는 emotion embedding $e$를 사용한 modified input sequence $\left[c+e, p,e_{\langle BT\rangle}, E_{text}, e_{\langle BA\rangle}, E_{sem}\right]$로 emotion control module을 refine 함
    - 해당 stage에서 $c$를 생성하는 speaker perceiver conditioner는 frozen 되지만 emotion perceiver conditioner는 trainable 함
    - Emotional expression을 speaker identity와 disentangle 하기 위해 GRL, speaker classifier를 적용하고, 이때 joint loss는 다음과 같이 정의됨:
    (Eq. 1) $ \mathcal{L}_{AR}=-\frac{1}{T+1}\sum_{t=0}^{T}\log q(y_{t})-\alpha\log q(e)$
    - $y_{T}$ : end-of-sequence token $<EA>$, $q(y_{t})$ : semantic token의 posterior probability, $q(e)$ : target speaker의 $e$에 대한 posterior probability, $\alpha$ : loss coefficient
  3. Stage 3
    - Stage 3에서는 robustness를 향상하기 위해 모든 feature conditioner를 freeze 하고 full dataset에 대한 fine-tuning을 수행함
- 추론 시 duration control은 $p=W_{num}h(T)$, free-form generation은 $p=0$으로 설정하여 수행됨
  - Emotional prosody는 desired emotion vector $e$를 input으로 설정하여 directly manipulate 됨

- Semantic-to-Mel Module (S2M)

S2M module을 flow matching에 기반한 non-autoregressive framework를 사용함
- S2M module은 prompt mel-spectrogram, speaker embedding, semantic feature를 기반으로 target mel-spectrogram을 synthesize 하는 데 사용됨
- 특히 emotional speech의 pronunciation을 개선하기 위해 GPT latent enhancement를 도입함
GPT Latent Enhancement
- Conditional Flow Matching (CFM) model은 speaker embedding, reference speech를 condition으로 T2S module의 semantic code에 대한 mel-spectrogram을 생성함
- 이때 emotional speech의 slurring을 mitigate 하기 위해 GPT latent feature $H_{GPT}$를 도입함
  1. 특히 $H_{GPT}$는 상당한 textual, contextual information을 encode 하고 있으므로, 논문은 $H_{GPT}$를 vector addition을 통해 semantic feature와 fuse 하여 context-enriched representation을 얻음
  2. 해당 fused feature는 S2M training process의 input으로 사용됨
Training and Inference
- Training 시 각 input sentence는 prompt segment와 target segment로 randomly split 됨
  - Target segment에 대한 mel-spectrogram은 source input을 구성하기 위해 fully noised 됨
- T2S module의 semantic token을 $Q_{sem}$이라 하자
  1. Pronunciation robustness를 위해 GPT hidden state $H_{GPT}$와 semantic token $Q_{sem}$을 MLP를 통해 $50\%$ probability로 randomly fuse 하여 final semantic representation $Q_{fin}$을 얻음
    - 추가적으로 timbre consistency를 위해 speaker embedding을 $Q_{fin}$에 concatenate 함
  2. 그러면 S2M은 prediction $y_{pred}$와 target $y_{tar}$ mel-spectrogram 간의 $L1$ loss를 통해 optimize 됨:
    (Eq. 2) $\mathcal{L}_{L1}=\frac{1}{F\cdot D}\sum_{f=1}^{F}\sum_{d=1}^{D}\left|(y_{pred})_{f,d} -(y_{tar})_{f,d}\right|$
    - $F$ : frame 수, $D$ : mel-frequency bin 수
- 추론 시 ODE solver는 speaker embedding과 final semantic representation $Q_{fin}$에 condition 되어 Gaussian noise로부터 mel-spectrogram을 생성함

- Text-to-Emotion Module (T2E)

논문은 natural language emotion control을 위해 T2E module을 도입함
- 먼저 7 basic emotion $\mathcal{E}=\{\texttt{Anger}, \texttt{Happiness}, \texttt{Fear}, \texttt{Disgust}, \texttt{Sadness}, \texttt{Surprise}, \texttt{Natural}\}$을 가정하자
  - 각 emotion $e_{i}\in\mathcal{E}$에 대해 T2S의 pre-trained emotion perceiver를 사용하여 emotional audio sample로부터 embedding을 추출하고 fixed emotion embedding set $\mathcal{V}$를 구성함
- 이후 large language model인 DeepSeek-r1을 teacher로 사용하여 text input $t$를 7-dimensional emotion probability distribution에 mapping 함:
  (Eq. 3) $p=\text{DeepSeek-r1}(t)\in\Delta^{7}$
  - $\Delta^{7}$ : 7-dimensional probability simplex로써 $\sum_{i=1}^{7}p_{i}=1,\,\, p_{i}\geq 0$
- 이때 efficient inference를 위해 Knowledge Distillation을 적용하여 teacher behavior를 smaller student model인 Qwen3-1.7b에 transfer 함
  1. 특히 논문은 다음의 2가지 prompt type을 사용해 DeepSeek-r1으로부터 1000개의 text-distribution pair를 구성하고, 각 generated sentence에 classification prompt를 적용하여 emotion distribution을 얻음:
    - Descriptive: $\texttt{Please generate descriptive sentences that express}\,\, \{\textit{emotion}\}$
    - Script-like: $\texttt{Please generate script-like utterances that express}\,\, \{\textit{emotion}\}$
  2. 이후 해당 dataset을 기반으로 LoRA를 통해 Qwen3-1.7b를 fine-tuning 함
  3. Training objecitve는 student prediction과 teacher distribution 간의 Cross-Entropy loss를 minimize 함:
    (Eq. 4) $\min_{\phi}\mathbb{E}_{(t,p)\sim\mathcal{D}}\left[\text{CrossEntropy}\left(\text{Qwen-3}_{ \theta+\phi}(t),p\right)\right]$
    - $\theta$ : Qwen3-1.7b의 original parameter, $\phi$ : LoRA parameter, $t$ : dataset $\mathcal{D}$의 input text sample, $p$ : teacher가 생성한 soft probability distribution
- Emotion vector $e_{input}$은 emotion embedding set $\mathcal{V}$에 대한 weighted average로 얻어짐:
  (Eq. 5) $e_{input}=\sum_{e\in\mathcal{E}}p_{e}\cdot \frac{1}{|\mathcal{V}_{e}|}\sum_{v\in\mathcal{V}_{e}}v$
- 최종적으로 해당 emotion vector를 T2S module의 prompt로 input 하여 desired emotional characteristic을 가지는 speech를 생성함

3. Experiments

- Settings

Dataset : Emilia, ESD
Comparisons : MaskGCT, F5-TTS, CosyVoice2, Spark-TTS, IndexTTS

- Results

전체적으로 IndexTTS2의 성능이 가장 우수함

Emotional synthesis 측면에서도 우수한 성능을 보임

IndexTTS2는 더 나은 natural language-based emotion control이 가능함

Duration-Specified Speech Synthesis
- IndexTTS2는 다양한 duration scale에 대해서 낮은 token number error rate를 보임

Duration control 시 IndexTTS2는 낮은 WER을 유지할 수 있음

MOS 측면에서도 우수한 성능을 보임

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language Instructions (0)	2026.04.13
[Paper 리뷰] DMOSpeech2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis (0)	2026.04.03
[Paper 리뷰] MELA-TTS: Joint Transformer-Diffusion Model with Representation Alignment for Speech Synthesis (0)	2026.03.27
[Paper 리뷰] Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis (0)	2026.03.25
[Paper 리뷰] VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency (0)	2026.03.23

최근에 올라온 글

최근에 달린 댓글

« 2026/07 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech

IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech

1. Introduction

2. Method

- Autoregressive Text-to-Semantic Module (T2S)

- Semantic-to-Mel Module (S2M)

- Text-to-Emotion Module (T2E)

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

티스토리 뷰

[Paper 리뷰] IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech

document.addEventListener("DOMContentLoaded", function() { renderMathInElement(document.body, { delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false} ] });});

IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech

1. Introduction

2. Method

- Autoregressive Text-to-Semantic Module (T2S)

- Semantic-to-Mel Module (S2M)

- Text-to-Emotion Module (T2E)

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바