[Paper 리뷰] FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language Instructions

티스토리 뷰

Paper/TTS

[Paper 리뷰] FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language Instructions

feVeRin 2026. 4. 13. 13:05

FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language Instructions

Zero-shot Text-to-Speech는 flexible style control을 지원할 수 있어야 함
FlexiVoice
- Progressive Post-Training을 통해 accurate, flexible style control을 지원
- 특히 Direct Preference Optimization과 multi-objective Group Relative Policy Optimization을 적용
논문 (ICLR 2026) : Paper Link

1. Introduction

Zero-shot Text-to-Speech (TTS)는 CosyVoice2, IndexTTS2와 같이 short reference speech 만으로 speaker의 timbre를 capturing 하고 reproducing 하는 것을 목표로 함
- 이때 ControlSpeech와 같은 instruction-based model은 natural language instruction을 이용하여 target style을 specify 할 수 있음
  - BUT, 해당 instruction-driven model은 instruction-following과 timbre consistency의 한계가 있음
- 특히 flexible style control을 위해서는 style-timbre content conflict 문제를 해결할 수 있어야 함
  - Standard supervised training에서는 model이 reference speech의 strong acoustic prior에 over-rely 하는 timbre leakage나 text로부터 prosody를 infer 하는 content leakage가 발생하기 때문

-> 그래서 zero-shot TTS에서 natural language instruction adherence를 보장할 수 있는 FlexiVoice를 제안

FlexiVoice
- Natural language instruction을 포함한 large-scale dataset인 FlexiVoice-Instruct dataset을 기반으로 Large Language Model (LLM)을 pre-training
- Multi-modality Direct Preference Optimization (DPO), Decoupling Group Relative Policy Optimization (GRPO), Instruction GRPO의 3-stage로 구성된 Progressive Post-Training (PPT)를 통해 modality conflict를 해결

< Overall of FlexiVoice >

PPT와 large-scale speech-instruction dataset을 활용한 zero-shot TTS model
결과적으로 기존보다 우수한 성능을 달성

2. Method

FlexiVoice는 CosyVoice2, IndexTTS2와 유사한 LLM architecture를 기반으로 함
- 먼저 speech tokenizer는 speech를 discrete token으로 convert 하고 LLM core는 input text, natural-language instruction, reference token을 기반으로 discrete speech token을 생성함
- 생성된 token을 flow matching을 통해 mel-spectrogram으로 transform 된 다음, 최종적으로 vocoder를 통해 waveform으로 convert 됨

- Pre-Training

FlexiVoice-base는 기본적으로 Emilia dataset을 활용하여 pre-train 됨
- 이때 논문은 instruction-guided TTS를 위해 다양한 scenario에 대한 natural-language instruction으로 구성된 FlexiVoice-Instruct dataset을 구축함
  - 추가적으로 기존 instruction-speech corpora와 NVSpeech dataset을 incorporate 하여 pre-training phase를 enrich 하고 paralinguistic tag와 expressive coverage를 제공함
- Pre-training 시에는 LLM core만 train 하고 다른 module은 freeze 함
  1. Text, instruction은 LLM input template를 따라 format 되고 paired ground-truth speech는 frozen speech tokenizer를 통해 discrete token으로 pre-process 됨
    - 이는 pre-training 시 생성된 token에 대한 loss를 compute 하는 데 사용됨
  2. Explicit instruction이 없는 Emilia, NVSpeech의 경우, default instruction으로 $\texttt{Speak the following text}$를 사용함

- Post-Training

Speaking style과 timbre를 disentangle 하고 complex instruction-following을 지원하기 위해 논문은 Progressive Post-Training (PPT)를 도입함
- Pre-training 만으로도 FlexiVoice-base는 solid zero-shot TTS가 가능하지만, multi-modality input과 complex instruction 측면에서는 한계가 있음
  - 즉, PPT는 progressive curriculum을 통한 robust multi-modality instruction TTS를 목표로 함
- 이때 PPT는 3-stage로 구성됨:
  - S1: Explicit label이 있는 controlled emotion-centric task에서 instruction과 reference speech의 multi-modality controllability를 align 함
  - S2: Reference speech의 timbre, style과 target text의 content, style을 disentangle 함
  - S3: Ambiguous 한 complex real-world instruction으로 extend 함
S1: Multi-Modality Controllability
- S1에서는 DPO를 사용하여 style instruction과 timbre reference를 empower 함
  - 이때 instruction을 $\texttt{Use}\,\,\{label\}\,\, \texttt{emotion to read it}$과 같은 template로 restrict 하고 label은 $\texttt{Neutral}$, $\texttt{Happy}$, $\texttt{Angry}$, $\texttt{Sad}$, $\texttt{Surprised}$ 중에서 choice 함
- Emotion-related task에서 paired preference data는 Speech Emotion Recognition dataset으로부터 얻어짐
  1. 따라서 논문은 Emotional Speech Dataset (ESD)를 사용하여 각 datapoint에 대해 instruction template를 따라 target emotion label을 assign 하고, target emotion을 가지는 sentence를 preffered sample로 채택함
  2. 다른 emotion을 가지는 identical sentence는 dis-preferred sample로 사용되고, 동일한 speaker의 neutral sample은 reference speech로 사용됨
- DPO는 explicit reward model 없이 model의 emotional output을 instruction과 reference speech에 align 함
  1. 먼저 preference dataset $\mathcal{D}$는 $(x,y_{w},y_{l})$과 같이 구성됨
    - $x$ : instruction, text, reference, $y_{w}$ : instruction과 match 되는 winner response, $y_{l}$ : loser response
  2. 그러면 DPO loss는:
    (Eq. 1) $ \mathcal{L}_{DPO}(\pi_{\theta};\pi_{ref})=-\mathbb{E}_{(x,y_{w},y_{l})\sim \mathcal{D}} \left[\log\sigma\left(\beta\log\frac{\pi_{\theta}(y_{w}|x)}{\pi_{ref}(y_{w}|x)} - \beta\log\frac{\pi_{\theta}(y_{l}|x)}{\pi_{ref}(y_{l}|x)}\right)\right]$
    - $\pi_{\theta}$ : policy model, $\pi_{ref}$ : reference model
S2: Decoupling of Reference Speech and Target Text
- S2는 speech reference, target text가 instruction과 conflict 될 때 FlexiVoice의 decoupling capability를 향상하는 것을 목표로 함
  - DPO training 이후 model은 neutral reference 하에서는 emotional instruction을 well-follow 하지만 reference/text가 emotion-laden인 경우 instruction의 target emotion과 conflict 됨
- 따라서 논문은 conflict training scenario 기반의 multi-objective GRPO formulation을 도입함
  1. 먼저 reward $r_{ser}$은 style constraint로 reference/text에서 style이 leak 될 때 model을 penalize 함
    - $r_{sv}$는 timbre constraint로써 speaker identity를 preserve 함
  2. 결과적으로 model은 joint advantange를 optimize 하여 해당 factor를 decouple 하도록 유도됨
- Emotion2Vec의 emotion recognition result의 probability score $r_{ser}\in(0,1)$, CAM++의 speaker verification result $r_{sv}\in\{0,1\}$에 대해
  1. Multi-objective reward는:
    (Eq. 2) $A^{i}_{emo}=\frac{r_{ser}^{i}-\text{mean}(r_{ser}^{i})}{\text{std}(r_{ser}^{i})} +\frac{r_{sv}^{i}-\text{mean}(r_{sv}^{i})}{\text{std}(r_{sv}^{i})}$
  2. 여기서 $i$는 동일한 input $x$에 대한 $K$ candidate 중에서 $i$-th completion을 indexing 함
S3: Enhancement on Complex Instruction-Following
- S3는 complex, real-world directive에 대한 insturction following을 향상하는 것을 목표로 함
  - 이때 paired preference data를 확보하는 것이 어려우므로 논문은 GRPO를 directly employ 함
- 이를 위해 논문은 Kimi-Audio-7B-Instruct를 reward model로 채택하여 generated speech가 instruction과 match 되는지에 대한 binary yes/no decision을 output 하도록 prompt 함
  - 즉, reward $r_{llm}\in\{0,1\}$과 같이 mapping 됨
- 특히 S3에서 reference는 open-ended constraint와 conflict 되고 training을 destabilize 할 수 있으므로 reference를 discard 하고 instruction과 text만 input으로 사용함
  1. 추가적으로 catastrophic forgetting을 방지하기 위해 S2-GRPO data의 일부를 mix 하여 final multi-task, multi-objective GRPO optimization을 구성함
  2. 그러면 single-objective advantage와 final advantage는:
    (Eq. 3) $A_{ins}^{i}=\frac{r_{llm}^{i}-\text{mean}(r_{llm}^{i})}{\text{std}(r_{llm}^{i})}, \,\,\, A^{i}=\left\{\begin{matrix}
    A_{emo}^{i}, & \text{for inputs in S2} \\
    A_{ins}^{i}, & \text{for inputs in S3} \\
    \end{matrix}\right.$

3. Experiments

- Settings

Dataset : FlexiVoice-Instruct
Comparisons : PromptStyle, PromptTTS, CosyVoice2, VoxInstruct, Parler-TTS

- Results

전체적으로 FlexiVoice의 성능이 가장 우수함

Multi-modality control 측면에서도 우수한 성능을 보임

Complex Instruction-Following Ability
- FlexiVoice는 robust 한 natural instruction following이 가능함

Progressive Post-Training
- S1, S2, S3 순서로 post-training을 수행하면 최적의 결과를 얻을 수 있음

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] NCF-TTS: Enhancing Flow Matching based Text-to-Speech with Neighborhood Consistency Flow (0)	2026.05.06
[Paper 리뷰] MambaVoiceCloning: Efficient and Expressive Text-to-Speech via State-Space Modeling and Diffusion Control (0)	2026.04.16
[Paper 리뷰] DMOSpeech2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis (0)	2026.04.03
[Paper 리뷰] IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech (0)	2026.04.02
[Paper 리뷰] MELA-TTS: Joint Transformer-Diffusion Model with Representation Alignment for Speech Synthesis (0)	2026.03.27

최근에 올라온 글

최근에 달린 댓글

« 2026/07 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language Instructions

FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language Instructions

1. Introduction

2. Method

- Pre-Training

- Post-Training

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

티스토리 뷰

[Paper 리뷰] FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language Instructions

document.addEventListener("DOMContentLoaded", function() { renderMathInElement(document.body, { delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false} ] });});

FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language Instructions

1. Introduction

2. Method

- Pre-Training

- Post-Training

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바