[Paper 리뷰] RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS

티스토리 뷰

Paper/Language Model

[Paper 리뷰] RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS

feVeRin 2026. 3. 12. 12:58

RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS

Emotion contorl과 같은 nuanced task에서 기존의 reward optimization method는 reward hacking 문제가 발생함
RRPO
- Hybrid regularization을 활용하여 reward signal이 reliably align 되도록 유도
- 특히 policy가 detrimental shortcut을 abandon 하고 emotion의 complex feature를 학습하도록 함
논문 (ICASSP 2026) : Paper Link

1. Introduction

CosyVoice2와 같이 Large Language Model (LLM)을 활용하면 우수한 Text-to-Speech (TTS) 성능을 달성할 수 있지만, emotional TTS와 같은 rich expressiveness 측면에서는 여전히 한계가 있음
- 이를 위해 EmoDiff, ZET-Speech, EmoMix 등은 Supervised Fine-Tuning (SFT)를 활용함
  - BUT, SFT 방식은 training data의 diversity에 큰 영향을 받음
- 한편으로 Emo-DPO, DiffRO는 Reinforcement Learning (RL)을 활용하여 LLM-based TTS system에서 preference alignment를 수행함
  1. 특히 DiffRO는 multi-task Reward Model (RM)에서 policy model로 gradient를 backpropagate 하여 기존 Policy Gradient (PG) method의 high variance 문제를 개선함
  2. BUT, DiffRO의 direct, deterministic optimization은 RM의 quality에 큰 영향을 받으므로 policy model에 의해 exploit 되는 reward hacking이 발생할 수 있음
    - 즉, policy model이 RM을 속이는 non-segmentic acoustic alignment를 생성해 high reward를 받음으로써 detrimental shortcut을 학습할 수 있음

-> 그래서 reward hacking을 mitigate 할 수 있는 robust reward optimization method인 RRPO를 제안

RRPO
- Human perception과 align 되는 robust RM을 구성하기 위해 hybrid regularization을 도입
- 특히 overconfidence, brittle decision boundary, perturbation sensitivity에 대한 correction을 수행

< Overall of RRPO >

Hybrid regularization을 활용해 LLM-based TTS의 expressiveness를 개선한 Reward Optimization method
결과적으로 기존보다 우수한 성능을 달성

2. Method

- Motivation: The Need for a Robust Reward Model

DiffRO는 fully differentiable optimization을 통해 기존 PG method에 대한 low-variance alternative를 제공함
- 이때 DiffRO는 다음 2가지 component를 기반으로 함:
  1. Generated trajectory $\tau$를 policy parameter $\theta$의 differentiable function $\tau(\theta)$로 취급하여 Gumbel-Softmax reparameterization을 지원함
  2. Differentiable RM을 적용하여 chain rule을 통해 policy gradient를 direct compute 함:
    (Eq. 1) $ \nabla_{\theta}J(\theta)=\nabla_{\theta}R(\tau(\theta))=\frac{\partial R}{\partial \tau}\cdot \frac{\partial \tau}{\partial \theta}$
    - $R(\tau(\theta))$ : Speech Emotion Recognition (SER)과 같은 differentiable loss function의 negative로 정의되는 reward
- 즉, DiffRO는 precise direction, magnitude를 모두 제공하여 efficient, token-level policy optimization을 보장함
- BUT, DiffRO는 RM의 flaw, bias 역시 analytical gradient를 통해 amplify 할 수 있음
  - 이는 결과적으로 incorrect gradient를 생성하여 policy를 exploitative solution으로 guide 할 수 있음

- Fine-Tuning with Hybrid Regularization

Reward hacking은 policy model이 true alignment를 expense 하면서 vanilla RM을 exploit 해 spurious reward를 maximize 하는 것을 의미함
- 이를 mitigate 하기 위해 논문은 hybrid regularization scheme을 활용하여 pre-trained RM에 대한 robust fine-tuning process를 수행함
  - 해당 process는 small human-annotated dataset을 활용하여 pre-training의 potential bias를 correct 함
- 추가적으로 RM이 small dataset에 overfit 되지 않고 new bias를 develope 하지 않도록 complementary regularization technique을 도입함
Label Smoothing: Correcting Overconfidence
- RM은 predict에 대한 overconfidence를 가질 수 있음
  - 특히 human emotion의 ambiguous nature를 capture 하기 어려운 SER task에서 exacerbate 됨
- 따라서 논문은 Label Smoothing (LS)를 도입해 hard one-hot label $\mathbf{y}$를 soft probability distribution $\mathbf{y}'$으로 replace 함:
  (Eq. 2) $\mathbf{y}'_{k}=(1-\epsilon)\cdot \mathbf{y}_{k}+\frac{\epsilon}{K}$
  - $K$ : class 수, $\epsilon$ : hyperparameter
Energy-Adaptie Mixup: Correcting Brittle Decision Boundaries
- Reward hacking은 brittle decision boundary로 인해 발생할 수도 있음
  - 이를 해결하기 위해 data augmentation method인 Energy-Adaptive Mixup (EAM)을 도입함
- EAM은 mixed speech segment의 relative energy, duration을 기반으로 mixing coefficient $\lambda$를 calculate 함
  1. 이때 final loss는 RM prediction과 original label에서 compute 된 LS loss $\mathcal{L}_{LS}$의 adaptively weighted interpolation을 사용함
  2. 이를 통해 RM이 datapoint 간의 smooth transition을 학습하도록 encourage 하여 sharp, brittle decision boundary를 correcting 하고 policy model이 해당 vulnerability를 exploit 하지 못하도록 함
- 먼저 논문은 EAM을 low-level acoustic feature $\mathbf{F}$의 각 batch에 적용함
  1. 각 sample $\mathbf{f}_{i}$와 paired sample $\mathbf{f}_{j}$에 대해, EAM은 mixed feature vector $\mathbf{f}'_{i}$와 해당 energy-adaptive mixing coefficient $\lambda_{i}$를 생성함
  2. 이후 mixed feature는 Transformer encoder로 전달되어 high-level embedding $\mathbf{h}'$을 생성하고, RM은 final output $\hat{\mathbf{y}}$를 predict 함
  3. 결과적으로 loss는 다음과 같이 얻어짐:
    (Eq. 3) $ \mathcal{L}_{emo}=\frac{1}{B}\sum_{i=1}^{B}\left[\left( 1-\lambda_{i}\right)\mathcal{L}_{LS}\left(\hat{\mathbf{y}}_{i},\mathbf{y}'_{i}\right)+\lambda_{i} \mathcal{L}_{LS}\left(\hat{\mathbf{y}}_{i},\mathbf{y}'_{i}\right)\right]$
Adversarial Training: Correcting Perturbation Sensitivity
- Policy model은 RM이 higher reward를 assign 하게 하는 subtle distortion을 학습할 수 있으므로, 논문은 Adversarial Training을 도입함
  - 특히 perturbation은 Fast Gradient method를 기반으로 high-level embedding $\mathbf{h}'$에 적용됨
- Perturbation $\delta$는 loss의 normalized gradient를 따라 ascending 하여 compute 됨:
  (Eq. 4) $\delta=\epsilon_{adv}\cdot\frac{\nabla_{\mathbf{h}'}\mathcal{L}_{emo}}{\left|\left| \nabla_{\mathbf{h}'}\mathcal{L}_{emo}\right|\right|_{2}}$
  - $\epsilon_{adv}$ : perturbation magnitude
- Adversarial sample은 $\mathbf{h}'_{adv}=\mathbf{h}'+\delta$와 같이 perturbation을 embedding에 add 하여 얻어짐
  - Perturbed embedding은 (Eq. 3)을 통해 adversarial loss $\mathcal{L}_{adv}$를 compute 하는데 사용됨
The Final Corrective Objective
- Hybrid regularization은 RM을 label confidence, decision boundary, perturbation sensitivity의 3-level에 대해 correct 하고, 이때 final SER loss는 다음과 같이 얻어짐:
  (Eq. 5) $\mathcal{L}_{ser}=\mathcal{L}_{emo}+\alpha\cdot \mathcal{L}_{adv}$
  - $\alpha$ : hyperparameter
- 해당 loss를 minimize 하여 exploitation에 robust 한 corrected RM을 얻을 수 있음
- 결과적으로 해당 robustness는 RM이 policy optimization phase에서 reliable guide로 동작하는 것을 보장하고, reward hacking을 mitigate 할 수 있도록 함

- Robust Reward Policy Optimization

RRPO는 robust reward의 gradient를 활용하여 reliably aligned update로 policy를 guide 함
- 이를 통해 reward hacking을 mitigate 하고 controllable, expressive emotional TTS system을 얻을 수 있음
- 최종적으로 policy objective $J(\theta)$는 robust RM을 통해 guide 됨:
  (Eq. 6) $\nabla_{\theta}J(\theta)=\nabla_{\theta}R_{robust}(\tau(\theta))$
  - $R_{robust}$ : (Eq. 5)의 hybrid regularized SER loss에 negative를 취하여 얻어지는 robust reward $-\mathcal{L}_{ser}$

3. Experiments

- Settings

Dataset : Mandarin speech dataset (internal)
Comparisons : DiffRO

- Results

전체적으로 RRPO의 성능이 가장 우수함

Ablation Study
- 각 component는 성능 향상에 유효함

'Paper > Language Model' 카테고리의 다른 글

[Paper 리뷰] VoxCPM: Hierarchical Semantic-Acoustic Modeling via Semi-Discrete Residual Representations for Expressive End-to-End Speech Synthesis (0)	2026.04.06
[Paper 리뷰] KALL-E: Autoregressive Speech Synthesis with Next-Distribution Prediction (0)	2026.03.31
[Paper 리뷰] Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space (0)	2025.11.20
[Paper 리뷰] Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance (0)	2025.11.19
[Paper 리뷰] EmoVoice: LLM-based Emotional Text-to-Speech Model with Freestyle Text Prompting (0)	2025.10.29

최근에 올라온 글

최근에 달린 댓글

« 2026/07 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS

RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS

1. Introduction

2. Method

- Motivation: The Need for a Robust Reward Model

- Fine-Tuning with Hybrid Regularization

- Robust Reward Policy Optimization

3. Experiments

- Settings

- Results

'Paper > Language Model' 카테고리의 다른 글

티스토리툴바

티스토리 뷰

[Paper 리뷰] RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS

document.addEventListener("DOMContentLoaded", function() { renderMathInElement(document.body, { delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false} ] });});

RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS

1. Introduction

2. Method

- Motivation: The Need for a Robust Reward Model

- Fine-Tuning with Hybrid Regularization

- Robust Reward Policy Optimization

3. Experiments

- Settings

- Results

'Paper > Language Model' 카테고리의 다른 글

티스토리툴바