[Paper 리뷰] EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS

티스토리 뷰

Paper/TTS

[Paper 리뷰] EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS

feVeRin 2026. 3. 5. 12:57

EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS

Large Language Model-based Text-to-Speech model은 fine-grained emotional control 측면에서 한계가 있음
EMORL-TTS
- VAD space의 global intensity control과 local emphasis regulation을 unify 함
- 특히 emotion category, intensity, emphasis의 task-specific reward를 통해 guide 되는 reinforcement learning과 supervised fine-tuning을 combine 함
논문 (ICASSP 2026) : Paper Link

1. Introduction

기존의 emotional Text-to-Speech (TTS)는 categorical emotion control로 인해 emotional strength와 subtle variation을 capture 하기 어려움
- 이를 해결하기 위해 EmoMix, EmoDiff, EmoSphere-TTS 등은 emotion intensity modeling과 mixed-emotion synthesis를 도입하여 continuous emotion control을 수행함
- BUT, Large Language Model (LLM)-based TTS에서는 discrete speech token으로 인해 continuous emotion intensity를 direct modeling 하기 어려움

-> 그래서 LLM-based TTS의 emotion controllability를 향상한 EMORL-TTS를 제안

EMORL-TTS
- Supervised Fine-Tuning (SFT)와 Group Relative Policy Optimization (GRPO)를 integrate
- VAD-based intensity modeling을 guide 하는 task-specific reward를 도입

< Overall of EMORL-TTS >

VAD-based prosody control을 SFT, GRPO를 통해 continuously control 하는 LLM-based emotional TTS
결과적으로 기존보다 우수한 성능을 달성

2. Method

- Overview

논문은 single-stage LLM-based TTS model인 Spark-TTS를 기반으로 함
- 이때 global acoustic trait과 semantic information을 jointly carry 하는 BiCodec을 freeze 하고 LLM만 2-stage post-training paradigm으로 adapt 함:
  1. Stage 1에서는 emotion-annotated data에 SFT를 수행하여 emotion-category controllability를 endow 하고, model이 intensity와 emphasis cue를 expose 하도록 함
  2. Stage 2에서는 GRPO 기반의 reinforcement learning을 수행하여 fine-grained prosody control을 향상함
- 먼저 text input $x$, emotion category $c\in\{1,...,K\}$, global intensity cue $r\in[0,1]$, $x$에서 emphasized token을 mark 하는 local emphasis mask $m\in\{0,1\}$이 주어진다고 하자
  1. 그러면 model은 trainable LLM policy $p_{\theta}$ 하에서 discrete speech token sequence $z=(z_{1},...,z_{T})$를 autoregressively predict 함:
    (Eq. 1) $ p_{\theta}(z|x,c,r,m)=\prod_{t=1}^{T}p_{\theta}(z_{t}|z_{<t},x,c,r,m)$
  2. 이후 frozen BiCodec decoder는 $\hat{y}=\text{BiCodecDecode}(z)$와 같이 token으로부터 waveform을 합성하고, post-training 시에는 LLM parameter $\theta$만 update 함

- Stage 1: Emotion-Controllable SFT

논문은 BiCodec representation을 사용하는 LLM-based TTS model인 Spark-TTS를 기반으로 구성됨
- Attribute tokenizer는 text에 prepend 되는 emotion category, discretized intensity control token을 accept 하도록 repurpose 됨
- Intensity label은 pre-trained VAD estimator를 사용하여 neutral centroid에 대한 Euclidean distance를 category-specific threshold로 discretize 하여 얻어짐
  - 이후 resulting bin index는 intensity token에 mapping 됨
- 결과적으로 Stage 1에서는 해당 control token을 condition으로 token-level cross-entropy를 minimize 하여 LLM을 fine-tuning 함
  - 이를 통해 emotion-category controllability와 reinforcement learning에 사용되는 calibrated intensity interface를 establish 함

- Stage 2: GRPO with Multi-Objective Rewards

Stage 2에서는 emotion, emphasis-controllable TTS를 sequential decision process로 cast 함
- 먼저 state $s\in \mathcal{S}$는 input text와 control token으로 구성되고, action $a\in\mathcal{A}$는 speech token의 generated sequence, $\pi_{\theta}$는 policy에 해당한다고 하자
- 그러면 training objective는 expected reward를 maximize 하는 것과 같음:
  (Eq. 2) $\nabla_{\theta}J(\theta)=\mathbb{E}_{s\sim \mathcal{D},a\sim \pi_{\theta}}\left[ R(s,a)\nabla_{\theta}\log \pi_{\theta}(a|s)\right]$
- GRPO
  1. 각 prompt $s$에 대해 $K$ candidate $a^{(k)}\sim \pi_{\theta}(\cdot |s)$를 sampling 하고 reward $R^{(k)}=R(s,a^{(k)})$를 compute 하고, group-relative advantage $A^{(k)}=R^{(k)}=\bar{R}$을 구성하자
    - $\bar{R}=\frac{1}{K}\sum_{j=1}^{K}R^{(j)}$
  2. GRPO는 SFT policy $p_{SFT}$의 KL anchor에 대해 clipped-ratio objective를 optimize 함:
    (Eq. 3) $\mathcal{L}_{GRPO}(\theta)=\mathbb{E}\left[ \min\left(\rho^{(k)}A^{(k)}, \text{clip}\left( \rho^{(k)}, 1\pm\epsilon\right)A^{(k)}\right)\right]-\beta\text{KL}\left(\pi_{\theta}(\cdot |s) ||p_{SFT}(\cdot |s)\right)$
    - $\rho^{(k)}=\frac{\pi_{\theta}(a^{(k)}|s)}{p_{SFT}(a^{(k)}|s)}$
- Emotion Classification Reward
  1. Emotion2Vec-based SER classifier는 $\hat{c}=\arg\max p(c|\hat{y})$를 predict 함
  2. SFT에서 얻은 category controllability를 preserve 하기 위해 large, sign-separated shaping을 적용함:
    (Eq. 4) $R_{ser}=\left\{\begin{matrix}
    +5, & \text{if}\,\,\hat{c}=c \\
    -1, & \text{otherwise} \\
    \end{matrix}\right.$
- Global Emotion Intensity Reward
  1. 논문은 SFT에서 pre-train 된 VAD predictor를 사용하여 $\mathbf{v}(\hat{y})\in[1,7]^{3}$을 얻고, neural centroid $\mu_{neu}=(3.8494,4.2614,3.9072)$ 간의 distance를 compute 함:
    (Eq. 5) $d(\hat{y})=||\mathbf{v}(\hat{y})-\mu_{neu}||_{2}$
  2. $d(\hat{y})$를 fixed bin $\{\texttt{weak}, \texttt{medium},\texttt{strong}\}$으로 discretize 하고 hard match를 smooth, bin-centered Gaussian과 combine 함:
    (Eq. 6) $R_{match}=\mathbf{1}\{\text{bin}(d)=r\}$
    $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,R_{dist}=\exp\left(-\frac{(d-m_{r})^{2}}{2\sigma_{r}^{2}}\right)$
    $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,R_{int}=R_{match}+R_{dist}$
    - $m_{r}$ : target bin의 midpoint, $\sigma_{r}$ : smoothness
- Local Emphasis Control Reward
  1. NeMo Forced Aligner (NFA)를 사용하여 word boundary를 얻고 각 word $w\in \{w_{1},...,w_{N}\}$에 대해 20ms window로 prosodic feature를 추출함:
    (Eq. 7) $ f_{pitch}(w)=\max_{\tau\in w}\log F_{0}(\tau)$
    $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, f_{energy}(w)=\text{mean}_{\tau\in w}||\text{STFT}(\tau)||_{2}$
  2. $\mu_{pitch}, \mu_{energy}$를 sentence-level mean이라고 하면, 각 emphasized word $w^{*}$에 대해:
    (Eq. 8) $R_{hard}^{pitch}=\mathbf{1}\{f_{pitch}(w^{*})=\max_{w}f_{pitch}(w)\}$
    (Eq. 9) $R_{hard}^{energy}=\mathbf{1} \{ f_{energy}(w^{*})=\max_{w}f_{energy}(w)\}$
    (Eq. 10) $R_{soft}^{pitch}=\text{clip}_{[-1,1]}\left(\frac{f_{pitch}(w^{*})-\mu_{pitch}}{\mu_{pitch}}\right)$
    (Eq. 11) $R_{soft}^{energy}=\text{clip}_{[-1,1]}\left(\frac{f_{energy}(w^{*})-\mu_{energy}}{\mu_{energy}}\right)$
  3. 그러면 emphasis reward는 $R_{emp}=R_{hard}^{pitch}+R_{hard}^{energy}+R_{soft}^{pitch}+R_{soft}^{energy}$와 같이 얻어짐
- 최종적으로 EMORL-TTS는 앞선 3가지 reward term을 summation 하여 사용함:
  (Eq. 12) $R=R_{ser}+R_{int}+R_{emp}$

3. Experiments

- Settings

Dataset : ESD
Comparisons : CosyVoice2, EmoSpeech, EmoSphere++

- Results

전체적으로 EMORL-TTS의 성능이 가장 우수함

Subjective Evaluation 측면에서도 우수한 성능을 보임

Naturalness 측면에서도 높은 성능을 달성함

EMORL-TTS는 emotion intensity를 효과적으로 반영할 수 있음

Emphasis recognition 측면에서도 높은 accuracy를 달성함

Effect of Part-of-Speech Emphasis on Emotion Intensity
- Adverb를 emphasize 하면 stronger perceived intensity를 제공할 수 있음

Part-of-Speech Emphasis와 Intensity 간의 관계

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis (0)	2026.03.18
[Paper 리뷰] DMP-TTS: Disentangled Multi-Modal Prompting for Controllable Text-to-Speech with Chained Guidance (0)	2026.03.11
[Paper 리뷰] PFluxTTS: Hybrid Flow-Matching TTS with Robust Cross-Lingual Voice Cloning and Inference-Time Model Fusion (0)	2026.02.26
[Paper 리뷰] ARCHI-TTS: A Flow-Matching-Based Text-to-Speech Model with Self-Supervised Semantic Aligner and Accelerated Inference (0)	2026.02.13
[Paper 리뷰] ParaStyleTTS: Toward Efficient and Robust Paralinguistic Style Control for Expressive Text-to-Speech Generation (0)	2026.01.14

최근에 올라온 글

최근에 달린 댓글

« 2026/07 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS

EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS

1. Introduction

2. Method

- Overview

- Stage 1: Emotion-Controllable SFT

- Stage 2: GRPO with Multi-Objective Rewards

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

티스토리 뷰

[Paper 리뷰] EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS

document.addEventListener("DOMContentLoaded", function() { renderMathInElement(document.body, { delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false} ] });});

EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS

1. Introduction

2. Method

- Overview

- Stage 1: Emotion-Controllable SFT

- Stage 2: GRPO with Multi-Objective Rewards

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바