[Paper 리뷰] CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering

티스토리 뷰

Paper/Language Model

[Paper 리뷰] CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering

feVeRin 2026. 6. 9. 13:06

CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering

대부분의 text-to-speech system은 single utterance-level emotion을 enforce 함
CoCoEmo
- Activation steering에 대한 multi-rater evaluation protocol을 도입
- Human-like emotional speech를 위한 lightweight steering approach를 적용
논문 (ICML 2026) : Paper Link

1. Introduction

Natural speech는 inherently complex 하고 multiple concurrent, conflicting affective signal이 combine 되는 경우가 많음
- 특히 대부분의 expressive Text-to-Speech (TTS) model은 emotion을 single, globally coherent state로 취급함
  - 이로인해 mixed emotion은 single dominant tone으로 average 됨
- 이를 위해 label granularity를 늘리거나 richer emotion annotation으로 retraining 할 수 있지만, 근본적인 원인을 해결하지는 못함
  1. 한편 steering vector를 활용하면 pre-trained TTS system의 latent representation space에서 controlled directional bias를 반영할 수 있음
    - 특히 mixed emotion은 multiple emotion-specific steering direction으로 나타나고 text-emotion misalignment는 textual content와 independent 하게 acoustic feature를 modulate 하여 express 됨
  2. BUT, steering vector를 Speech Language Model (SLM)에서 효과적으로 적용하기 위해서는 steering 위치, steering 방법, steering evaluation 등에 대한 gap을 해결해야 함

-> 그래서 SLM에서 steering vector의 동작을 분석하여 controllability를 개선한 CoCoEmo를 제안

CoCoEmo
- Modular emotional TTS architecture에 대한 in-depth analysis를 수행하고 evaluation을 위한 multi-rater protocol을 도입
- 추가적으로 optimal SLM layer에 steering vector를 inject 하여 reliable mixed-emotion synthesis를 지원

< Overall of CoCoEmo >

SLM과 같은 hybrid TTS system을 bridge하는 steering vector mechanism
결과적으로 기존보다 우수한 성능을 달성

2. Disentangling Emotion in SLM and Flow-Matching

- Model Overview

Hybrid TTS system은 일반적으로 2-stage architecture를 사용함
- $\mathbf{x}_{i}$를 $i$-th input text sequence, $\mathbf{c}_{ref}$를 target emotion에 대한 reference signal이라고 하자
- First stage에서 TTS language model $f_{SLM}$은 해당 input을 discrete speech token sequence $\mathbf{z}$로 mapping 함:
  (Eq. 1) $ \mathbf{z}_{i}=f_{SLM}(\mathbf{x}_{i},\mathbf{c}_{ref})$
  - $\mathbf{z}=(z_{i}^{1},...,z_{i}^{T})$ : token sequence
- Second stage에서 flow-matching acoustic model $f_{Flow}$는 speech token sequence를 mel-spectrogram으로 transform 하고 pre-trained vocoder $g_{voc}$를 통해 waveform으로 convert 함:
  (Eq. 2) $\mathbf{m}_{i}=f_{Flow}(\mathbf{z}_{i},\mathbf{c}_{ref}),\,\,\, \mathbf{y}_{i}=g_{voc}(\mathbf{m}_{i})$

- Where to Steer 1: Modular Analysis

Cross-Conditioning Diagnostic
- Emotional expression에 대한 SLM과 Flow-Matching module의 contribution을 disentangle 하기 위해 논문은 Cross-Conditioning Diagnostic을 도입함
- $\mathbf{c}^{e},\mathbf{c}^{n}$을 각각 emotional, neutral conditioning signal이라고 하자
  1. SLM-Driven
    - Emotion reference는 speech token $\mathbf{z}_{i}$를 modify 하기 위해 SLM에만 적용되고, flow-matching module은 neutral condition에서 동작함:
    (Eq. 3) $\mathbf{z}_{i}^{e}=f_{SLM}(\mathbf{x}_{i},\mathbf{c}^{e}),\,\,\, \mathbf{m}_{SLM}=f_{Flow}(\mathbf{z}_{i}^{e},\mathbf{c}^{n})$
  2. Flow-Driven
    - SLM은 neutral이고 emotion reference는 flow-matching을 통해서만 도입됨:
    (Eq. 4) $\mathbf{z}_{i}^{n}=f_{SLM}(\mathbf{x}_{i},\mathbf{c}^{n}),\,\,\, \mathbf{m}_{Flow}=f_{Flow}(\mathbf{z}_{i}^{n},\mathbf{c}^{e})$
  3. Emotion이 SLM에서 encode 된다면 SLM-Driven은 stronger emotional expressiveness를 생성해야 함
    - 그렇지 않으면 Flow-Driven이 dominate 함

Findings and Design Implications
- Energy contour 측면에서 SLM-Driven condition은 emotion 별로 distinct prosodic pattern이 나타나고, Flow-Driven condition은 largely overlapped contour가 나타남
  - 즉, flow-matching module은 prosody를 alter 하지 않고 acoustic rendering에만 관여함
- 위 표의 cross-conditioning diagnostic에서 SLM-Driven은 lower CCC, higher SR STD를 가짐
  - 즉, SLM은 synthesized emotional feature의 variability를 govern 하고 flow-matching은 local rendering을 refine 함
- 결과적으로 SLM이 emotional prosody의 primary driver이므로 emotion steering은 SLM에 적용되어야 함

- Where to Steer 2: Layer and Operator Selection

Why Linear Separability
- Mixed emotion에서 steering vector는 complex expression을 생성하기 위해 서로 다른 direction을 가리키도록 combine 되므로, linear separability는 steerability의 proxy로 사용될 수 있음
- 즉, higher separability를 가질수록 steering vector를 reliable extract 할 수 있고 combine 할 수 있음
Layer- and Operation-Level Probing for SLM Steering
- $\mathcal{D}=\{(\mathbf{x}_{i},\mathbf{a}_{i},y_{i})\}_{i=1}^{N}$을 $N$ sample로 구성된 dataset이라고 하자
  - $\mathbf{x}_{i}$ : input text, $\mathbf{a}_{i}$ : reference emotional speech, $y_{i}\in\{0,1,...,E\}$ : emotion label
- SLM은 multiple operation $\mathcal{O}^{(l)}$을 가진 $L$ Transformer layer를 가지고, 이때 layer-/operational-wise activation은:
  (Eq. 5) $\mathbf{h}_{i}^{(l,o)}=\left\{\begin{matrix} \text{Op}^{(l,o)}(\mathbf{x}_{i},\mathbf{a}_{i}), & l=1 \\ \text{Op}^{(l,o)}(\mathbf{h}_{i}^{(l-1)}), & l=2,...,L \\ \end{matrix}\right.,\,\,\, o\in\mathcal{O}^{(l)}$
  - $\text{Op}^{(l,o)}$ : attention, feed-forward network와 같은 operation
- Emotion이 most distinctly represent 되는 위치를 identify 하기 위해, 논문은 $y_{i}$를 predict 하는 linear probe $\mathbf{h}_{i}^{(l,o)}$를 training 하고 accuracy를 통해 linear separability를 measure 함
  - Highest discriminability를 가지는 Top-$K$ layer, operation은 steering vector를 추출하고 inject 하는 데 사용됨
Findings and Design Implications
- 아래 그림과 같이 CosyVoice2에서는 10-17 layer가 strong linear separability를 가지고, operation 중에서는 $\texttt{attn\_output}$이 highest discriminability를 보임
  - IndexTTS2의 경우 5-10 layer
- 결과적으로 mid-to-late layer와 attention output은 emotion representation에 대한 highest linear separability를 가짐

3. Method

위 결과를 바탕으로 논문은 identified model layer에서 각 individual emotion에 대한 steering vector를 추출함
- Mixed-emotion vector는 single-emotion vector의 weighted combination으로 구성되고, emotion proportion에 대한 quantitative control을 지원함
- Steering vector는 linguistic representation과는 independent 하게 emotional acoustic variation으로부터 추출되고 text-emotion mismatch를 handling 함

- Steering Vector Construction

Single Emotion Steering
- 논문은 mean-difference approach를 활용하여 emotion steering vector를 compute 하고 mean neutral representation에서 mean target emotion representation으로 이동함
  - 이때 acoustic emotion information을 isolate 하기 위해 same speaker, transcript를 가지는 sample만 compare 함
- Emotion label $y_{i}\in\{0,...,E\}$, neutral speech $y_{i}=0$에 대해, dataset $\mathcal{D}=\{(\mathbf{x}_{i},\mathbf{a}_{i},y_{i})\}_{i=1}^{N}$이 주어진다고 하자
  1. 먼저 speaker, linguistic content를 control 하기 위해 speaker-matched neutral-emotion pair를 구성함
  2. 특히 각 target emotion $e\in\mathcal{Y}$에 대해, same speaker의 emotion-$e$ utterance와 neutral utternace를 pair 하여 두 subset $D^{(e)}, D_{0}^{(e)}$를 구성함
  3. Sample $i$에서 select 된 $l$ layer와 operation $o$에서의 last-token activation을 $\mathbf{h}_{i}^{(l,o)}$라고 하면, emotion $e$의 steering vector는 emotion-$e$ sample과 paired neutral counterpart 간의 mean representation과 같음:
    (Eq. 6) $\mathbf{v}_{e}^{(l,o)}=\frac{1}{|\mathcal{D}^{(e)}|}\sum_{i\in\mathcal{D}^{(e)}} \mathbf{h}_{i}^{(l,o)}-\frac{1}{|\mathcal{D}_{0}^{(e)}|}\sum_{j\in\mathcal{D}_{0}^{(e)}} \mathbf{h}_{j}^{(l,o)}$
- 결과적으로 vector $\mathbf{v}_{e}^{(l,o)}$는 latent space에서 emotion $e$에 대한 direction을 capture하고 추론 시 inject되어 target emotion expression을 induce함
  - Mismatch scenario에서 steering vector는 text-implied emotion을 override 하고 internal bias로 동작함
Mixed Emotion Steering
- Mixed emotion은 single emotion vector $\mathbf{v}_{e}^{(l,o)}$를 combine 하여 steering vector를 compute 함
- Target emotion에 대한 weight를 $\{p_{e}\}^{E}_{e=1}$이라 하고 $\sum_{e=1}^{E}p_{e}=1$이라고 할 때, mixed emotion steering vector는:
  (Eq. 7) $\mathbf{v}_{mix}^{(l,o)}=\sum_{e=1}^{E}p_{e}\mathbf{v}_{e}^{(l,o)}$

- Inference-Time Steering

추론 시에는 single emotion steering vector $\mathbf{v}_{e}^{(l,o)}$ 또는 mixed emotion vector $\mathbf{v}_{mix}^{(l,o)}$가 selected Top-$K$ layer와 operation에 inject 됨
- 각 selected layer, operation에서 activation $\mathbf{h}$는 steering을 통해 modulate 됨:
  (Eq. 8) $\tilde{\mathbf{h}}_{i}^{(l,o)}=\mathbf{h}_{i}^{(l,o)}+\alpha\cdot \mathbf{v}^{(l,o)}$
  - $\alpha$ : steering intensity, $\mathbf{v}^{(l,o)}$ : single emotion $\mathbf{v}^{(l,o)}_{e}$ 또는 mixed emotion $\mathbf{v}_{mix}^{(l,o)}$
- 추가적으로 논문은 original activation scale을 preserve 하고 semantic coherence를 maintain 하기 위해 $\tilde{\mathbf{h}}_{i}^{(l,o)}\leftarrow\frac{|| \mathbf{h}_{i}^{(l,o)}||}{||\tilde{\mathbf{h}}_{i}^{(l,o)}||}\cdot \tilde{\mathbf{h}}_{i}^{(l,o)}$와 같이 renormalize 함

- Mixed-Emotion Evaluation

Mixed-emotion synthesis를 evaluate 하기 위해서는 soft ground-truth가 필요함
- 이를 위해 논문은 multi-rater annotation을 활용함
  1. 각 speech recording $\mathbf{a}_{i}$는 $M$ rater에 의해 one-hot vector $y_{i,m}\in\{0,1\}^{|E|}$로 label 됨
  2. 이때 consensus distribution은:
    (Eq. 9) $\mathbf{p}_{i}=\frac{1}{M}\sum_{m=1}^{M}y_{i,m}$
    - e.g., $E=\{\texttt{happy, sad, angry}\}$에 대해 두 rater가 $\texttt{happy}$를 label 하고 한 rater가 $\texttt{sad}$를 label 했다면, $\mathbf{p}_{i}=[\frac{2}{3},\frac{1}{3},0]$과 같음
- 해당 consensus distribution은 (Eq. 7)의 steering vector $\mathbf{v}_{mix}^{(l,o)}$에 대한 mixing weight로 사용되고, synthesized speech는 $\mathbf{p}_{i}$가 derive 되는 동안 ground-truth target speech $\mathbf{a}_{i}$와 compare 됨

4. Experiments

- Settings

Dataset : ESD, RAVDESS, CREMA-D
Comparisons : CosyVoice2, IndexTTS2

- Results

CoCoEmo를 적용하면 더 나은 mixed-emotion synthesis가 가능함

Emotion2Vec Similarity, Target Emotion Probability, Spearman Correlation 측면에서도 우수한 성능을 보임

Text-Emotion Mismatch Speech Synthesis
- Mismatched set에 대해서도 robust 한 성능을 달성함

Activation steering을 활용하면 E-SIM을 consistently improve 할 수 있음

Mismatch Synthesis에서 Steering Strength $\alpha$의 효과

Single Emotion Steering
- Steering이 없는 $\alpha=0$에 비해 $\alpha$가 커질수록 TEP가 증가함
- 즉, steering vector를 통해 correct directional bias를 반영할 수 있음

Layer-wise Steering Analysis
- CosyVoice2에서 17, 14 layer는 highest separability를 가짐

'Paper > Language Model' 카테고리의 다른 글

[Paper 리뷰] DisCo-Speech: Controllable Zero-Shot Speech Generation with a Disentangled Speech Codec (0)	2026.07.13
[Paper 리뷰] Emo-LiPO: Listwise Preference Optimization for Fine-Grained Emotion Intensity Control in LLM-based Text-to-Speech (0)	2026.06.26
[Paper 리뷰] EmoShift: Lightweight Activation Steering for Enhanced Emotion-Aware Speech Synthesis (0)	2026.05.11
[Paper 리뷰] VibeVoice: Expressive Podcast Generation with Next-Token Diffusion (0)	2026.04.14
[Paper 리뷰] VoxCPM: Hierarchical Semantic-Acoustic Modeling via Semi-Discrete Residual Representations for Expressive End-to-End Speech Synthesis (0)	2026.04.06

최근에 올라온 글

최근에 달린 댓글

« 2026/07 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering

CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering

1. Introduction

2. Disentangling Emotion in SLM and Flow-Matching

- Model Overview

- Where to Steer 1: Modular Analysis

- Where to Steer 2: Layer and Operator Selection

3. Method

- Steering Vector Construction

- Inference-Time Steering

- Mixed-Emotion Evaluation

4. Experiments

- Settings

- Results

'Paper > Language Model' 카테고리의 다른 글

티스토리툴바

티스토리 뷰

[Paper 리뷰] CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering

document.addEventListener("DOMContentLoaded", function() { renderMathInElement(document.body, { delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false} ] });});

CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering

1. Introduction

2. Disentangling Emotion in SLM and Flow-Matching

- Model Overview

- Where to Steer 1: Modular Analysis

- Where to Steer 2: Layer and Operator Selection

3. Method

- Steering Vector Construction

- Inference-Time Steering

- Mixed-Emotion Evaluation

4. Experiments

- Settings

- Results

'Paper > Language Model' 카테고리의 다른 글

티스토리툴바