[Paper 리뷰] EmoVoice: LLM-based Emotional Text-to-Speech Model with Freestyle Text Prompting

티스토리 뷰

Paper/Language Model

[Paper 리뷰] EmoVoice: LLM-based Emotional Text-to-Speech Model with Freestyle Text Prompting

feVeRin 2025. 10. 29. 12:45

EmoVoice: LLM-based Emotional Text-to-Speech Model with Freestyle Text Prompting

Text-to-Speech model은 여전히 emotional expression 측면에서 한계가 있음
EmoVoice
- Large Language Model을 활용하여 fine-grained freestyle natural language emotion control을 지원
- Phoneme token과 audio token을 parallel output 하여 content consistency를 향상
논문 (MM 2025) : Paper Link

1. Introduction

Emotion-contorllable Text-to-Speech (TTS) model은 emotional richness, expressiveness 측면에서 한계가 있음
- EmoDiff, ZET-Speech, EmoMix와 같은 기존 TTS model은 coarse emotion category label에 의존하므로 nuanced emotion을 comprehensively capture 하기 어려움
  - 특히 emotional TTS를 위한 high-quality emotion dataset은 extremely scare 함
- 한편으로 PromptTTS, InstructTTS와 같이 natural language prompt를 사용하면 style-controllable TTS가 가능하지만 emotion characteristic에 대한 control은 대부분 지원하지 않음
- Emotion evaluation 측면에서도 time-consuming 한 subjective evaluation 외에 적합한 metric이 없음
  - Emotion2Vec을 사용하여 embedding similarity를 calculate 하는 방법을 고려할 수 있지만, fine-grained emotion evaluation에 대한 reliability의 한계가 있음

-> 그래서 freestyle text prompting을 활용하여 emotion control을 개선한 EmoVoice를 제안

EmoVoice
- Large Language Model (LLM)을 활용해 prompt encoder 없이 emotion description을 directly input
- Chain-of-Thought (CoT), Chain-of-Modality (CoM)을 따라 phoneme, audio token을 parallel output

< Overall of EmoVoice >

Natural language prompting을 활용한 emotion-controllable LLM-based TTS model
결과적으로 기존보다 우수한 성능을 달성

2. Method

- Model

EmoVoice는 backbone으로 causal pre-trained LLM인 Qwen2.5-0.5B를 채택함
- Input은 emotion에 대한 fine-grained description, 생성할 text를 포함한 pure text로 구성됨
  - 즉, input text는 $\text{<SYSTEM>: Say this sentence with emotion of <Description>. \n <Text>.}$의 format으로 구성되고, Qwen2.5-0.5B tokenizer를 통해 tokenize 됨
- EmoVoice는 50Hz CosyVoice semantic token을 speech output으로 autoregressively predict 한 다음, flow matching module과 HiFi-GAN vocoder를 통해 audio waveform으로 변환함
  1. 논문은 original LLM vocabulary $V_{t}$와 해당 embedding space에 audio token을 위한 새로운 codebook $V_{a}$를 추가하여 expanded vocabulary $V_{j}=V_{t}\cup V_{a}$를 구성함
    - 이때 original LLM의 vocabulary embedding matrix는 변경되지 않고 audio token의 embedding은 randomly initialize 됨
  2. 이후 각 prediction step에서 output logit의 audio part를 추출하여 audio token에 대한 predicted distribution을 얻음:
    (Eq. 1) $x_{a}=\text{logits}[...,|V_{t}|:]$
- 추가적으로 논문은 semantic group modeling을 사용하여 generated sequence length를 compress 함
  1. 이를 위해 각 prediction step에서 group size $G$의 $G$ semantic token을 predict 함
    - 이때 linear layer를 사용하여 audio logit $L_{a}$를 group-sized logit $L_{g}$로 project 함
    - $L_{g}\in\mathbb{R}^{|V_{a}|\times G}$
  2. 결과적으로 각 prediction step에서 model input은 group 내 각 semantic token의 average embedding value에 해당함
    - 그러면 output semantic token에 대해 cross-entropy loss를 calculate 할 수 있음
EmoVoice-PP
- EmoVoice-PP는 EmoVoice의 phoneme boost variant로써 output에서 parallel audio-phoneme modeling을 통해 semantic, phoneme token을 simultaneously predict 함
- 추론 시에는 phoneme token rate (~11Hz)가 audio token rate (~17Hz) 보다 낮으므로, phoneme token이 먼저 predict 되어 audio token의 final generation을 guide 하는 intermediate supervision signal로 동작함
  1. 먼저 Qwen2.5-0.5B tokenizer vocabulary에는 phoneme이 포함되어 있지 않으므로 각 phoneme을 새로운 token으로 vocabulary에 추가하여 modified vocabulary $V'_{t}$를 얻음
    - Phoneme token에 해당하는 embedding은 randomly initialize 됨
  2. 각 prediction step에서는 output logit으로부터 audio, phoneme part를 separately extract 함:
    (Eq. 2) $x_{a}=\text{logits}[...,|V'_{t}|:], x_{p}=\text{logits}[...,:|V'_{t}|]$
    - 각각 audio, phoneme token에 대한 predicted distribution
  3. 각 prediction step의 model input은 group 내 모든 semantic token의 average embedding value와 phoneme token을 사용함

- Training Pipeline

EmoVoice training은 2-phase로 구성됨
- First phase에서는 standard TTS training data를 사용하여 model을 pre-training 함
  - 이때 input text는 $\text{<SYSTEM>: Say this sentence. \n <Text>.}$와 같이 구성됨
- Second phase에서는 text, natural language emotion description, emotionally expressive speech로 구성된 instruction data를 사용하여 model을 fine-tuning 함
  - Input text는 $\text{<SYSTEM>: Say this sentence with emotion of <Description>. \n <Text>.}$와 같음

Model Variants (a) Output Audio Token Only (b) Sequential Output (c) Parallel Output (d) Interleaved Output

3. Experiments

- Settings

Dataset : EmoVoice-DB
Comparisons : PromptStyle, PromptTTS, CosyVoice, CosyVoice2

- Results

전체적으로 EmoVoice의 성능이 가장 우수함

MOS 측면에서도 우수한 성능을 달성함

다른 language (Chinese)에 대해서도 우수한 성능을 보임

Ablation Study
- EmoVoice-PP, EmoVoice-PT가 다른 variant에 비해 좀 더 나은 성능을 보임

English Hard-case의 경우 EmoVoice-PP의 성능이 가장 뛰어남

Data Augmentation 역시 성능 향상에 효과적임

Scaling LLM Size
- 1.5B로 LLM을 scaling 하면 더 나은 성능을 보임

LLM Initialization
- Initialization을 수행하면 더 나은 결과를 얻을 수 있음

Emotion Evaluation Metrics
- 추가적으로 Spearman's $\rho$ 측면에서 각 evaluation metric은 human perception과 큰 차이를 보임

'Paper > Language Model' 카테고리의 다른 글

[Paper 리뷰] PALLE: Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis (0)	2025.10.02
[Paper 리뷰] FELLE: Autoregressive Speech Synthesis with Token-wise Coarse-to-Fine Flow Matching (0)	2025.09.30
[Paper 리뷰] Differentiable Reward Optimization for LLM based TTS System (0)	2025.09.19
[Paper 리뷰] VALL-E2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers (0)	2025.08.03
[Paper 리뷰] CosyVoice3: Towards In-the-Wild Speech Generation via Scaling-up and Post-Training (0)	2025.07.27

최근에 올라온 글

최근에 달린 댓글

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] EmoVoice: LLM-based Emotional Text-to-Speech Model with Freestyle Text Prompting

EmoVoice: LLM-based Emotional Text-to-Speech Model with Freestyle Text Prompting

1. Introduction

2. Method

- Model

- Training Pipeline

3. Experiments

- Settings

- Results

'Paper > Language Model' 카테고리의 다른 글

티스토리툴바