[Paper 리뷰] EmoShift: Lightweight Activation Steering for Enhanced Emotion-Aware Speech Synthesis

본문 바로가기 메뉴 바로가기

티스토리 뷰

Paper/Language Model

[Paper 리뷰] EmoShift: Lightweight Activation Steering for Enhanced Emotion-Aware Speech Synthesis

feVeRin 2026. 5. 11. 10:35

EmoShift: Lightweight Activation Steering for Enhanced Emotion-Aware Speech Synthesis

Large Language Model은 emotion-specific latent characteristic을 modeling 하는데 한계가 있음
EmoShift
- Output embedding space에서 각 target emotion에 대한 steering vector를 학습
- 해당 EmoSteer layer를 incorporate하여 lightweight activation-steering framework를 구성
논문 (ICASSP 2026) : Paper Link

1. Introduction

Text-to-Speech (TTS)에서 emotion control은 주로 emotion embedding modulation을 통해 수행됨
- 특히 CosyVoice, Emo-DPO와 같은 Large Language Model (LLM)-based TTS는 natural-language prompt를 통해 emotion modulation을 수행함
  - BUT, 해당 방식은 interpretability와 emotion-specific latent dynamics encoding 측면에서 한계가 있음
- 이때 activation steering을 활용하면 generative model에 대한 lightweight, interpretable control이 가능함

-> 그래서 steering vector를 통해 LLM-based TTS의 emotion controllability를 향상한 EmoShift를 제안

EmoShift
- Output embedding space에서 emotion-specific steering vector를 학습하는 EmoSteer layer를 도입
- 특히 emotion-dependent latent offset을 encoding 하여 model-agnostic integration을 지원

< Overall of EmoShift >

Steering vector를 활용한 emotion controllable LLM-based TTS paradigm
결과적으로 기존보다 우수한 성능을 달성

2. Method

- LLM-based Problem Formulation and Modeling Setup

논문은 emotion-aware TTS를 conditional autoregressive token generation task로 formulate 함
- 이때 model은 speaker embedding $\mathbf{s}\in\mathbb{R}^{d}$, emotion prompt $Q$로 represent 된 prompt embedding sequence $\{\mathbf{q}\}^{u}_{i=1}$, speech script $X$에서 derive 된 text embedding $\{\mathbf{x}_{j}\}_{j=1}^{n}$의 3가지 information source로 condition 됨
- LLM은 special end-to-sequence token Ⓔ가 predict 될 때까지 discrete speech token sequence $\{\mathbf{y}'_{k}\}^{m}_{k=1}$을 생성함
  - 이후 해당 token은 flow matching-based vocoder를 사용하여 speech waveform $Y'$으로 decode 됨
- Conditioning information을 encode 하기 위해 input sequence는 special token과 함께 구성됨:
  (Eq. 1) $ \left[Ⓢ, \mathbf{s},\{\mathbf{q}_{i}\}_{i=1}^{u},Ⓟ,\{\mathbf{x}_{j}\}_{j=1}^{n},Ⓣ,\{\mathbf{y}_{k}\}_{k=1}^{m}, Ⓔ\right]$
  - Ⓢ, Ⓟ, Ⓣ, Ⓔ : 각각 start-of-sequence, end-of-prompt, turn-of-speech, end-of-sequence token
- Training 시 ground-truth token $\{\mathbf{y}_{k}\}$는 teacher-forcing으로 반영되고, 추론 시 generation은 Ⓣ 이후에 ground-truth context 없이 autoregressively proceed 됨:
  (Eq. 2) $P(\mathbf{y}'_{1:m})=\prod_{k=1}^{m}p(\mathbf{y}'_{k}|\mathbf{s}\{\mathbf{q}_{i}\}, \{\mathbf{x}_{j}\},\mathbf{y}'_{<k})$
- Model은 ground-truth token의 negative log-likelihood를 minimize 하도록 training 됨:
  (Eq. 3) $\mathcal{L}=-\sum_{k=1}^{m}\log p(\mathbf{y}_{k})$

Steering Vector Operation

- EmoShift: Emotion-Specific Activation Steering

Emotion-aware LLM-based TTS는 multi-condition generation을 지원하지만 model parameter에서 emotion expression space를 explicitly make 하지 못함
- 이를 해결하기 위해 EmoShift는 emotion-aware LLM-based TTS의 originally generated hidden state $\mathbf{h}$를 emotion-dependent offset에 project 하여 manipulable emotion representation space를 구성함
- Prompt $Q$에서 target emotion $e$가 주어지면 EmoShift는 learnable projection matrix $\mathbf{W}_{e}\in\mathbb{R}^{d\times d}$를 사용하여 각 hidden state $\mathbf{h}\in\mathbb{R}^{d}$에 대해 steering vector $\mathbf{v}_{e}=\mathbf{hW}_{e}$를 compute 함
  1. 이때 hidden state는:
    (Eq. 4) $\mathbf{h}'=\mathbf{h}+\epsilon\cdot \mathbf{v}_{e}$
    - $\epsilon$ : scaling factor
  2. $\mathbf{W}_{e}$는 emotion-specific activation shift pattern을 capture 하고 $\mathbf{v}_{e}$는 emotion $e$에 대한 neutral prosody의 expressive deviation을 encode 함
  3. 추론 시에는 fine-grained control을 위해 gain factor $\alpha\geq 1$을 도입함:
    (Eq. 5) $\mathbf{h}'=\mathbf{h}+\alpha\epsilon\cdot \mathbf{v}_{e}$
    - $\alpha$를 adjust 하여 target emotion identity를 preserve 하면서 emotional expression을 smoothly modulate 할 수 있고, 모든 steering direction은 compact projection matrix $\mathbf{W}_{e}$에 store 됨

Overview

3. Experiments

- Settings

Dataset : ESD
Comparisons : CosyVoice

- Results

전체적으로 EmoShift의 성능이 가장 우수함

Model 성능 비교

MOS 측면에서도 EmoShift가 가장 효과적임

Subjective Evaluation

Impact of Steering Coefficient
- Steering coefficient를 사용하면 더 나은 emotion recognition accuracy를 달성할 수 있음

Steering Coefficient의 효과

AB test 측면에서도 steering coefficient를 사용하는 것이 더 선호됨

AB Test

'Paper > Language Model' 카테고리의 다른 글

[Paper 리뷰] VibeVoice: Expressive Podcast Generation with Next-Token Diffusion (0)	2026.04.14
[Paper 리뷰] VoxCPM: Hierarchical Semantic-Acoustic Modeling via Semi-Discrete Residual Representations for Expressive End-to-End Speech Synthesis (0)	2026.04.06
[Paper 리뷰] KALL-E: Autoregressive Speech Synthesis with Next-Distribution Prediction (0)	2026.03.31
[Paper 리뷰] RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS (0)	2026.03.12
[Paper 리뷰] Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space (0)	2025.11.20

댓글

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

티스토리툴바