[Paper 리뷰] Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization

티스토리 뷰

Paper/TTS

[Paper 리뷰] Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization

feVeRin 2025. 3. 17. 08:43

Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization

Emotional Text-to-Speech는 주로 supervised training을 사용하여 text와 desired emotion을 emotional speech로 변환함
- BUT, 단순히 correct emotional output만을 학습하므로 emotion 간의 nuance를 capture 하지 못함
Emo-DPO
- Preferred emotion을 optimizing 하여 emotional nuance를 differentiate 하는 Direct Preference Optimization을 활용
- Emotion-aware Large Language Model을 통해 in-context learning과 instruction-following capability를 반영
논문 (ICASSP 2025) : Paper Link

1. Introduction

Emotional Text-to-Speech (TTS)는 text와 desired emotional tone을 기반으로 human-like speech를 생성하는 것을 목표로 함
- 이때 emotional TTS model은 stress, intonation, rhythm과 같은 nuanced expression을 고려할 수 있어야 함
- BUT, 대부분의 emotional TTS model은 FastSpeech, VITS와 같은 architecture나 EmoDiff와 같은 diffusion-based/flow-matching model에 의존함
  1. 따라서 Large Language Model (LLM) integration을 통한 in-context learning이나 instruction-following capability를 활용하지 못함
  2. 한편으로 LLM을 활용한 CLaM-TTS, VALL-E, CosyVoice 등은 speech token을 효과적으로 modeling하고 high-quality zero-shot synthesize 성능을 보이고 있음
- 특히 기존 emotional TTS model을 training하는데 사용되는 supervised learning은 instance 당 single emotion에만 focus 함
  - 결과적으로 multiple emotion에 대한 model control을 방해하고 emotion 간의 subtle difference를 효과적으로 capture 하지 못함

-> 그래서 emotional TTS에서 LLM과 Direct Preference Optimization을 활용하여 nuanced distinction을 capture 하는 Emo-DPO를 제안

Emo-DPO
- Emotional TTS model에 emotion-aware LLM을 integrate
- 추가적으로 Direct Preference Optimization (DPO)를 통해 emotional preference를 효과적으로 distinguish 하고 emotional expressiveness를 향상

< Overall of Emo-DPO >

Emotion-aware LLM과 DPO를 활용한 emotional TTS model
결과적으로 기존보다 뛰어난 emotion controllability를 달성

2. Method

Emo-DPO는 LLM-based TTS neural architecture를 사용하여 Direct Preference Optimization (DPO)를 통해 Emotional TTS를 개선하는 것을 목표로 함

- Overview

Emo-DPO는 text, speaker x-vector, desired emotion input으로부터 emotional speech를 합성함
- 전체적으로 instruction tuning, emotional-aware LLM-TTS와의 integration을 활용하여 pre-defined instruction data에서 specified emotional prompt에 해당하는 speech token sequence를 생성할 likelihood를 optimize 함
- 추론 시 Emo-DPO는 text, desired emotion, speaker x-vector input으로부터 speech token을 생성한 다음, frozen flow-matching model과 froze vocoder를 사용해 emotional speech를 생성함

- Instruction Tuning

First stage에서는 LLM의 instruction-following과 in-context learning capability를 활용하기 위해 parallel emotion text-to-speech data $D_{sft}$를 사용하여 LLM-TTS $\pi$에 대한 supervised fine-tuning을 수행함
- 여기서 data는 다음의 instruction template를 활용하여 formatting 됨:
  (Eq. 1) $d_{j}\in D_{sft}=E.\text{<endofprompt>}x_{j}\text{</s>}y_{j}^{+}\text{</s>}$
  - $E$ : Happy, Angry와 같은 emotion prompt word
  - $x_{j}, y_{j}^{+}$ : text token sequence, $E$에 해당하는 speech token sequence
  - $\text{<endofprompt>},\text{</s>}$ : emotion trigger end를 indicate 하는 special token, separator token
- Speech tokenizer는 speech token sequence를 추출하고, text encoder와 LLM-based decoder로 구성된 LLM-TTS model은 emotional speech token의 probability distribution을 예측함
- 논문은 $\pi, P_{\pi}$에 의해 induce 된 probability distribution과 target distribution $P$ 간의 divergence를 minimize 하기 위해 label smoothing Kullback-Leibler (KL) loss를 적용함:
  (Eq. 2) $\mathcal{L}_{KL}=\text{KL}(P_{\pi}||P)=\mathbb{E}_{d_{j}\sim D_{sft}}\left[p\left(y_{j}^{+}|E,x_{j}\right)\log\frac{p\left(y_{j}^{+}|E,x_{j}\right)}{p_{\pi}\left(y_{j}^{+}|E,x_{j}\right)}\right]$
  - 이를 통해 $\pi$는 input text의 specified emotional prompt에 align 하여 speech token sequence를 생성하고, generated speech가 $E$에서 indicate 된 desired emotion을 reflect 하도록 함

- Emo-Direct Preference Optimization Training

단순히 $\pi$에 대한 instruction tuning을 수행하면 model은 correct output을 생성하는 것만을 학습함
- 따라서 desired emotional speech와 same semantic content를 가진 다른 emotion 간의 subtle difference를 capture 할 수 있도록 preference learning을 도입함
- 특히 DPO는 model이 preference data에서 directly learning 할 수 있도록 하여 generated speech가 intended emotional nuance에 closely align 되도록 함
DPO Training
- 먼저 Emo-DPO fine-tuning을 위한 pairwise preference data를 구성하기 위해 $d_{j}$를 positive instance로 취급하자
  1. Negative instance의 경우 same text input $x_{j}$를 share 하지만 다른 emotional speech output을 가지는 다른 instance를 training data에서 sampling 하여 사용함
  2. 그러면 paired data $(d_{j}^{+},d_{j}^{-})\in D_{pref}$는 $E.\text{<endofprompt>}x_{j}\text{</s>}y_{j}^{+}\text{</s>}$, $E.\text{<endofprompt>}x_{j}\text{</s>}y_{j}^{-}\text{</s>}$과 같이 formulate 됨
- First-stage instruction tuning 후의 LLM-TTS model을 $\pi_{sft}$라고하면, optimize 할 pairwise dataset $D_{pref}$와 LLM-TTS $\pi$가 주어졌을 때 DPO objective는:
  (Eq. 3) $\mathcal{L}_{DPO}(\pi;\pi_{sft})=-\mathbb{E}_{(d_{j}^{+},d_{j}^{-})\sim D_{pref}}\left[\log \sigma\left(\beta\log \frac{\pi(y_{j}^{+}|E,x_{j})}{\pi_{sft}(y_{j}^{+}|E,x_{j})}-\beta\log \frac{\pi(y_{j}^{-}|E,x_{j})}{\pi_{sft}(y_{j}^{-}|E,x_{j})}\right)\right]$
  - $\pi$ : $\pi_{sft}$로 initialize 됨, $\pi()$ : $\pi$가 output sequence를 생성하는 conditional probability
  - $\beta$ : $\pi$가 $y_{j}^{+}$를 $y_{j}^{-}$ 보다 prefer 하는 sharpness를 modulate 하는 hyperparameter
  - $\sigma$ : sigmoid function
- DPO objective는 $\pi$가 $y_{j}^{+}$를 생성할 likelihood를 maximize 하면서 $x_{j}$와 emotion trigger word $E$에 따라 $y_{j}^{-}$를 생성할 likelihood를 minimize 함
Emo-DPO Training Objective
- 논문은 training을 stabilize 하기 위해 2가지 regularization strategy를 도입함
- 먼저 Jensen-Shannon (JS) divergence manipulation을 DPO objective에 적용함:
  (Eq. 4) $\text{logits}=\text{logratio}_{chosen}-\text{logratio}_{reject}=\log\left(\frac{\pi(y_{j}^{+}|E,x_{j})}{\pi_{sft}(y_{j}^{+}|E,x_{j})}\right)-\log \left(\frac{\pi(y_{j}^{-}|E,x_{j})}{\pi_{sft}(y_{j}^{-}|E,x_{j})}\right), $
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \text{JSD}=\log(1+e^{\text{logratio}_{chosen}})-\log(1+e^{\text{logratio}_{reject}}), $
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\text{logits}=\text{logits}-\text{JSD},$
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\mathcal{L}_{DPO}(\pi;\pi_{sft})=-\mathbb{E}_{(d_{j}^{+},d_{j}^{-})\sim D_{pref}}[\log\sigma(\beta\cdot \text{logits})]$
  - (Eq. 4)는 optimization process를 smooth 하고 extrem logit difference를 방지하여 training stability를 개선함
  - 특히 JS Divergence의 bounded, symmetric nature를 통해 balanced, interpretable preference learning을 제공함
- 다음으로 JS-regularized DPO objective, instruction tuning의 first-stage에서 define 된 label-smoothing KL objective, additional SFT objective를 jointly optimize 함
  1. 즉, total loss term은:
    (Eq. 5) $\mathcal{L}=\alpha\mathcal{L}_{DPO}+\gamma\mathcal{L}_{KL}+\theta\mathcal{L}_{SFT}$
    - $\mathcal{L}_{SFT}=-\log \left(\pi(y_{j}^{+}|E,x_{j})\right)$
    - $\alpha,\gamma, \theta$ : 각 loss term의 strength를 control 하는 hyperparameter
  2. Label-smoothing KL loss와 SFT loss 모두 pre-trained LLM-TTS distribution과 align 되면서 task-specific emotional speech generation에 progressively adapting 하도록 보장함
  3. JS-regularized DPO loss는 model이 pairwise comparison에서 nuanced preference를 학습하여 refined, emotionally aligned output으로 guiding 함

3. Experiments

- Settings

Dataset : ESD
Comparisons : CosyVoice, EmoSpeech

- Results

전체적으로 Emo-DPO의 성능이 가장 뛰어남

MOS 측면에서도 Emo-DPO가 가장 우수한 성능을 보임

AB test 측면에서도 Emo-DPO가 가장 선호됨

AB Test (좌) CosyVoice vs. Emo-DPO (우) EmoSpeech vs. Emo-DPO

Ablation Study
- Ablation study 측면에서 각 component를 제거하는 경우 성능 저하가 발생함

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers (0)	2025.03.26
[Paper 리뷰] SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow (0)	2025.03.25
[Paper 리뷰] DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors (0)	2025.03.03
[Paper 리뷰] BnTTS: Few-Shot Speaker Adaptation in Low-Resource Setting (1)	2025.02.16
[Paper 리뷰] ProsodyFlow: High-Fidelity Text-to-Speech through Conditional Flow Matching and Prosody Modeling with Large Speech Language Models (3)	2025.02.02

최근에 올라온 글

최근에 달린 댓글

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization

Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization

1. Introduction

2. Method

- Overview

- Instruction Tuning

- Emo-Direct Preference Optimization Training

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바