[Paper 리뷰] ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech

티스토리 뷰

Paper/Conversion

[Paper 리뷰] ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech

feVeRin 2025. 7. 9. 17:01

ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech

Emotional Voice Conversion에서 flexible, interpretable control은 여전히 한계가 있음
ClapFM-EVC
- Natural language prompt와 catrgorical label을 통해 guide 되는 emotional contrastive language-audio pre-training model을 도입
- Pre-trained Automatic Speech Recognition model의 Phonetic PosteriorGram을 seamless fuse 하는 FuEncoder를 적용
- 추가적으로 captured feature로 condition 된 flow matching model을 통해 mel-spectrogram을 reconstruct
논문 (INTERSPEECH 2025) : Paper Link

1. Introduction

Emotional Voice Conversion (EVC)는 original content, speaker identity를 preserve 하면서 target category의 emotional state로 convert 하는 것을 목표로 함
- 특히 EVC는 emotion, content, timbre와 같은 speech attribute를 효과적으로 decoupling 해야 함
  - 이를 위해 Generative Adversarial Network (GAN), AutoEncoder 등을 활용할 수 있음
- BUT, 여전히 EVC model은 emotional diversity가 부족하고 intensity control이 어려움
  - 추가적으로 대부분의 model은 reference speech나 categorical text label을 condition으로 사용하므로 conveyed emotion에 대한 interpretability 측면에서 한계가 있음

-> 그래서 emotion에 대한 intuitive control을 지원하는 ClapFM-EVC를 제안

ClapFM-EVC
- Speech-text modality에서 emotion feature를 추출하고 align 하기 위해 natural language prompt와 emotional categorical label을 통해 guide 되는 EVC-CLAP을 도입
- Pre-trained Automatic Speech Recognition (ASR) model과 FuEncoder, Conditional Flow Matching (CFM)으로 구성된 end-to-end VC model인 AdaFM-VC를 활용
  - 특히 FuEncoder에 Adaptive Intensity Gate (AIG)를 적용하여 emotion representation을 얻고, 이를 ASR model의 Phonetic PosteriorGram (PPG)와 integrate 하여 converted waveform의 intensity를 control 함
- 추론 시에는 주어진 natural language prompt를 기반으로 target emotion embedding을 생성

< Overall of ClapFM-EVC >

Natural language prompt와 CFM을 활용한 emotion-controllable VC model
결과적으로 기존보다 우수한 성능을 달성

2. Method

- System Overview

ClapFM-EVC는 EVC-CLAP, FuEncoder, CFM-based decoder과 pre-trained ASR model, vocoder로 구성됨
- 먼저 EVC-CLAP은 Kullback-Leibler divergence based contrastive loss (symKL-loss)와 함께, natural language prompt와 해당 categorical emotion label로부터 derive 된 soft label을 사용하여 training 됨
  - 이를 통해 EVC-CLAP은 audio, text modality 간의 emotional representation을 추출하고 align 할 수 있고, ClapFM-EVC는 language prompt의 fine-grained emotional information을 capture 할 수 있음
- 이후 논문은 EVC-CLAP과 pre-trained ASR model에서 각각 추출된 emotional element, content representation을 사용하여 AdaFM-VC를 training 함
  1. AdaFM-VC 내의 FuEncoder는 emotional, content characteristic을 seamless integrate 하고 AIG module은 emotional intensity를 explicitly control 함
  2. AdaFM-VC의 CFM model은 random Gaussian noise로부터 FuEncoder output을 sampling 하고 EVC-CLAP을 통해 생성된 target emotional vector를 condition으로 하여 target mel-spectrogram을 생성함
- 최종적으로 생성된 mel-spectrogram feature는 converted speech를 synthesis 하기 위해 pre-trained BigVGAN vocoder에 전달됨
- 추론 시 ClapFM-EVC는 target emotional embedding을 얻기 위해 3가지 mode를 활용할 수 있음:
  1. 주어진 reference speech를 활용하는 방법
  2. 주어진 natural language emotional prompt를 활용하는 방법
  3. Pre-constructed high-quality reference speech corpus에서 EVC-CLAP을 통해 relevant data를 retrieve 한 다음, retrieved speech에서 target emotion element를 추출하는 방법

- Soft-Labels-Guided EVC-CLAP

EVC-CLAP training은 same class 내의 data pair 간 distance를 minimize 하면서 다른 category의 data pair 간 distance를 maximize 하는 것을 목표로 함
- 먼저 source speech $X_{i}^{a}$, emotional label $X_{i}^{y}$, natural language prompt $X_{i}^{p}$에 대해 input data pair를 $\{X_{i}^{a},X_{i}^{y}, X_{i}^{p}\}$라고 하자 ($i\in [0,N]$, $N$ : batch size)
  1. EVC-CLAP은 pre-trained HuBERT-based audio encoder와 pre-trained XLM-RoBERTa-based text encoder를 사용하여 $X_{i}^{a},X_{i}^{p}$를 2개의 latent variable $Z_{a}\in\mathbb{R}^{N\times D}, Z_{p}\in\mathbb{R}^{N\times D}$로 compress 함
    - $D=512$ : hidden state dimension
  2. 이후 similarity matrix $S_{pred}^{a}, S_{pred}^{p}$를 compute 함:
    (Eq. 1) $ S_{pred}^{a}=\epsilon_{a}\times (Z_{a}\cdot Z_{p}^{\top}),\,\, S_{pred}^{p}=\epsilon_{t}\times (Z_{p}\cdot Z_{a}^{\top})$
    - $\epsilon_{a},\epsilon_{t}$ : learnable hyperparamter로써 $2.3$으로 empirically initialize 됨
  3. 다음으로 $X_{i}^{y}, X_{i}^{p}$에서 derive 된 soft label $M_{GT}^{s}\in\mathbb{R}^{N\times N}$의 guidance에 따라 EVC-CLAP을 training 하기 위해 symKL-loss를 적용함:
    (Eq. 2) $M_{GT}^{s}=\alpha_{e}M_{GT}^{y}+(1-\alpha_{e})M_{GT}^{p}$
    - $\alpha_{e}=0.2$ : $M_{GT}^{y}, M_{GT}^{p}$를 adjust 하는 hyperparameter
- 여기서 same batch 내의 서로 다른 data pair의 categorical emotional label이나 natural language prompt label이 identical 한 경우, 해당 ground-truth는 $1$로 assign 되고 그렇지 않은 경우 $0$으로 assign 됨
  - 추가적으로 batch에 대해 label distribution consistency를 보장하기 위해 class similarity matrix $M_{GT}^{E}, M_{GT}^{P}$는 각 row의 sum이 $1$이 되도록 normalize 됨
- 결과적으로 EVC-CLAP의 training loss는 다음과 같음:
  (Eq. 3) $\mathcal{L}_{symKL}=\frac{1}{4}\left( \text{KL}\left(S_{pred}^{a}||M_{GT}^{s}\right)+ \text{KL}\left(\tilde{M}_{GT}^{s}||S_{pred}^{a}\right) + \text{KL}\left(S_{pred}^{p}|| M_{GT}^{s}\right) +\text{KL}\left( \tilde{M}_{GT}^{s}||S_{pred}^{p}\right)\right)$
  (Eq. 4) $\tilde{M}_{GT}^{s}=(1-\alpha)\cdot M_{GT}^{s}+\frac{\alpha}{N}$
  (Eq. 5) $\text{KL}(S||M)=\sum_{i,j}S(i,j)\log \frac{S(i,j)}{M(i,j)}$
  - $\alpha=1\times 10^{-8}$ : hyperparameter

- AdaFM-VC

FuEncoder with AIG
- FuEncoder는 pre-trained ASR model로부터 추출된 content feature와 EVC-CLAP에서 derive 된 emotional embedding을 seamlessly integrate 하는 것을 목표로 함
  - 추가적으로 Adaptive Intensity Gate (AIG)는 emotion intensity에 대한 flexible control을 제공함
- FuEncoder는 preprocessing network (PreNet), positional encoding module, AIG module, adaptive fusion module, linear mapping layer로 구성됨
  1. 먼저 PreNet은 source content feature $Z_{c}$를 latent space로 compress 하여 dropout mechanism을 통해 overfitting을 방지함
  2. Positional encoding module은 sinusoidal positional encoding을 사용해 $Z_{c}$의 positional characteristic을 추출함
    - 특히 $Z_{c}$와 element-wise addition을 수행하여 FuEncoder가 sequential, structural information을 학습하도록 함
  3. 이후 EVC-CLAP의 emotional feature에 learnable hyperparameter를 multiply 하여 emotional intensity를 flexibly adjust 하는 AIG module을 적용함
- 한편으로 FuEncoder의 adaptive fusion module은 multiple fusion block으로 구성됨
  1. 각 block은 multi-head self-attention, 2개의 emotion adaptive layer norm, position-wise feed-forward network layer로 구성되어 content, emotion information을 fuse 함
  2. 이를 통해 linguistic, emotional characteristic을 모두 포함하는 rich embedding representation을 얻음
- 최종적으로 fused feature는 fully-connected layer를 통해 specific dimension $f\in \mathbb{R}^{B\times T\times D}$로 mapping 됨
Conditional Flow Matching-based Decoder
- Naturalness를 향상하기 위해 논문은 Optimal Transport (OT)-based CFM model을 채택하여 standard Gaussian noise $x_{0}=p_{0}(x)=\mathcal{N}(x;0,I)$으로부터 target mel-spectrogram $x_{1}=p_{1}(x)$를 reconstruct 함
  1. 이때 captured EVC-CLAP emotional embedding을 condition으로 한 OT flow $\psi_{t}:[0,1]\times \mathbb{R}^{d}\rightarrow \mathbb{R}^{d}$를 CFM-based decoder training에 활용함
  2. 해당 decoder는 timestep fusion이 포함된 6개의 CFM block으로 구성되고, 각 CFM block은 ResNet module, multi-head self-attention, FiLM layer를 가짐
- Learnable, time-dependent vector field $v_{t}:[0,1]\times \mathbb{R}^{d}\rightarrow \mathbb{R}^{d}$를 modeling 하기 위해 논문은 Ordinary Differential Equation을 활용함
  1. 이를 통해 flow는 $p_{0}(x)$에서 target distribution $p_{1}(x)$로의 optimal transport path를 approximate 할 수 있음:
    (Eq. 6) $ \frac{d}{dt}\psi_{t}(x)=v_{t}\left(\psi_{t}(x),t\right)$
    - $\psi_{0}(x)=x, t\in[0,1]$
  2. 여기서 optimal transport path는 다음과 같이 simplify 될 수 있음:
    (Eq. 7) $\psi_{t,z}(x)=\mu_{t}(z)+\sigma_{t}(z)x$
    - $\mu_{t}(z)=tz, \sigma_{t}(z)=(1-(1-\sigma_{\min})t)$, $z$ : random conditioned input
    - $\sigma_{\min}=0.0001$ : individual sample을 perturb 하는 white noise의 minimum standard deviation
- 결과적으로 얻어지는 AdaFM-VC의 training loss는:
  (Eq. 8) $\mathcal{L}=\mathbb{E}_{t,p(x_{0}), q(x_{1})}\left|\left| \left(x_{1}-(1-\sigma)x_{0}\right) - v_{t}\left(\psi_{t,x_{1}}(x_{0})|h\right)\right|\right|^{2}$
  - $x_{0}\sim p(x_{0}),x_{1}\sim q(x_{1}), t\sim \mathcal{U}[0,1]$
  - $q(x_{1})$ : data의 potentially non-Gaussian distribution, $h$ : EVC-CLAP을 통해 추출된 conditional emotion embedding

3. Experiments

- Settings

Dataset : Emotional VC dataset (internal)
Comparisons : StarGAN-EVC, Seq2Seq-EVC, MixEmo

- Results

전체적으로 ClapFM-EVC의 성능이 가장 우수함

EVC by Natural Language Prompt
- ABX test 측면에서 prompt를 사용한 경우가 더 선호됨

Ablation Study
- 각 component를 제거하는 경우 성능 저하가 발생함

'Paper > Conversion' 카테고리의 다른 글

[Paper 리뷰] ReFlow-VC: Zero-Shot Voice Conversion based on Rectified Flow and Speaker Feature Optimization (0)	2025.07.25
[Paper 리뷰] LinearVC: Linear Transformations of Self-Supervised Features through the Lens of Voice Conversion (0)	2025.07.22
[Paper 리뷰] LM-VC: Zero-Shot Voice Conversion via Speech Generation based on Language Models (0)	2025.07.07
[Paper 리뷰] StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion (0)	2025.07.03
[Paper 리뷰] EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice Conversion (0)	2025.06.21

최근에 올라온 글

최근에 달린 댓글

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech

ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech

1. Introduction

2. Method

- System Overview

- Soft-Labels-Guided EVC-CLAP

- AdaFM-VC

3. Experiments

- Settings

- Results

'Paper > Conversion' 카테고리의 다른 글

티스토리툴바