[Paper 리뷰] CycleFlow: Leveraging Cycle Consistency in Flow Matching for Speaker Style Adaptation

티스토리 뷰

Paper/Conversion

[Paper 리뷰] CycleFlow: Leveraging Cycle Consistency in Flow Matching for Speaker Style Adaptation

feVeRin 2025. 4. 16. 17:53

CycleFlow: Leveraging Cycle Consistency in Flow Matching for Speaker Style Adaptation

기존 voice conversion system은 inaccurate pitch, low speaker adaptation quality 문제를 가지고 있음
CycleFlow
- Speaker timbre adaptation training을 위해 Conditional Flow Matching에 Cycle Consistency를 도입
- VoiceCFM, PitchCFM에 기반한 Dual-CFM을 활용하여 speaker pitch adpatation quality를 향상
논문 (ICASSP 2025) : Paper Link

1. Introduction

Voice Conversion (VC)는 original linguistic content를 maintain 하면서 source speaker의 speaking style을 target speaker style로 transfer 하는 것을 목표로 함
- AVQVC, NANSY 등의 기존 방식은 Wav2Vec 2.0, HuBERT와 같은 Self-Supervised Learning (SSL)이나 Automatic Speech Recognition (ASR)을 통해 source speech의 disentangled linguistic content를 추출함
  1. BUT, 여전히 content representation에는 timbre leakage가 존재하므로 speaker similarity의 한계가 있음
  2. 추가적으로 cross-domain VC task의 경우 source domain speaker pitch가 target domain speaker의 vocal range를 exceed 할 수 있음
- 한편으로 well-decoupled content encoder가 없는 경우 unparallel training sample을 사용하여 domain 간의 mapping을 학습할 수도 있음
  1. 대표적으로 CycleGAN-VC, MaskCycleGAN-VC 등은 cycle consistency를 Generative Adversarial Network (GAN)에 도입하여 해당 mapping을 학습함
    - 특히 cycle consistency는 sample pair를 regularizing 하여 forward/inverse conversion 간의 transitivity를 enforce 할 수 있음
  2. BUT, GAN-based model의 경우 training이 복잡하고 gradient explosion/vanishing이 발생하기 쉬움

-> 그래서 cycle consistency를 conditional flow matching과 결합한 VC model인 CycleFlow를 제안

CycleFlow
- Well-decoupled content representation 없이 speaker timbre adaptation을 개선하기 위해 Cycle Consistency Regularization을 활용
- Speaker pitch adaptation을 위해 speech generation, pich correction에 대한 Dual-Conditional Flow Matching (Dual-CFM)을 도입

< Overall of CycleFlow >

Dual-CFM과 cycle consistency regularization을 결합한 VC model
결과적으로 기존보다 우수한 conversion 성능을 달성

2. Method

- Speech Disentanglement

논문은 speech를 content, pitch, timbre representation으로 disentangle 함
1. Content
  - CosyVoice의 speech tokenizer는 speech를 supervised semantic token으로 compress 함
    - 해당 token은 encoder의 initial 6-layer에 vector quantization (VQ)를 inserting 한 ASR model로 얻어짐
  - Speech tokenizer는 rich text의 recognition error를 minimize 하도록 end-to-end training 되므로 extracted token은 linguistic information에 대해 strong semantic relationship을 가짐
2. Pitch
  - $F0$는 pre-trained RMVPE model을 통해 추출됨
3. Timbre
  - Pre-trained speaker encoder를 사용하여 reference audio로부터 speaker embedding vector를 추출함
  - 해당 speaker embedding은 Dual-CFM decoder의 guide를 위해 사용됨

- Cycle Consistency Regularization

논문은 CFM model이 content condition $c_{x}$를 사용해 source speech $x_{1}$을 convert 한 다음, target speaker style $s_{y}$를 사용하여 noisy latent $y_{t}$를 $y_{1}$로 denoise 하는 것을 목표로 함
- 이때 해당 VC model $v_{\theta}(y_{t},s_{y},c_{x})$를 학습하기 위해, converted speech에서 source semantic, target speaker style을 preserve 하는 cycle consistency regularization을 도입함
- 즉, cyclic conversion에서 time-dependent variance를 dropping 하여 consistency loss set을 얻을 수 있음:
  (Eq. 1) $\mathcal{L}_{x\rightarrow x}=\mathbb{E}_{x_{0},\varepsilon}|| v_{\theta}(x_{t},s_{x},c_{x})-(x_{1}-(1-\sigma)x_{0})||_{2}^{2}$
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \mathcal{L}_{y\rightarrow y}=\mathbb{E}_{x_{0},\varepsilon }||v_{\theta}(y_{t},s_{y},c_{\bar{y}_{0}})-(y_{1}-(1-\sigma)y_{0})||_{2}^{2}$
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \mathcal{L}_{x\rightarrow y\rightarrow x}=\mathbb{E}_{x_{0},\varepsilon }||v_{\theta}(y_{t},s_{x},c_{x})-(y_{1}-(1-\sigma)y_{0})+ v_{\theta}(x_{t},s_{y},c_{x})-(x_{1}-(1-\sigma)x_{0})||_{2}^{2}$
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \mathcal{L}_{x\rightarrow y\rightarrow y}=\mathbb{E}_{x_{0},\varepsilon }|| v_{\theta}(x_{t},s_{y},c_{x})-v_{\theta}(x_{t},s_{y},c_{\bar{y}_{0}})||_{2}^{2}$
  1. Reconstruction Loss
    - $\mathcal{L}_{x\rightarrow x}, \mathcal{L}_{y\rightarrow y}$는 CycleFlow가 speech를 reverse하는 conditional flow matching으로 동작하도록 함
  2. Cycle Consistency Loss
    - $\mathcal{L}_{x\rightarrow y\rightarrow x}$는 transitivity regularization으로써, forward/inverse conversion이 original speech $x_{1}$을 reconstruct할 수 있도록 보장함
  3. Invariance Loss
    - $\mathcal{L}_{x\rightarrow y\rightarrow y}$는 target speaker style domain이 forward conversion에서 invariant 하도록 함
    - 즉, $c_{x},s_{y}$를 condition으로 하는 $x_{t}$에서 $\bar{y}_{1}$로의 forward conversion이 주어졌을 때, $\bar{y}_{1}$의 speaker style $s_{y}$, content $c_{y}$에 대한 conditioned conversion을 repeating 하면 $\bar{y}_{1}$을 reproduce 할 수 있음
- 결과적으로 training objective는:
  (Eq. 2) $\mathcal{L}_{x}=\lambda_{1}\mathcal{L}_{x\rightarrow x}+\lambda_{2}\mathcal{L}_{x\rightarrow y\rightarrow x}+\lambda_{3}\mathcal{L}_{x\rightarrow y\rightarrow y}$
- $x\leftrightarrow y$의 모든 conversion cycle을 고려했을 때, CycleFlow의 complete training objective는:
  (Eq. 3) $\mathcal{L}_{Cycle}=\mathcal{L}_{x}+\mathcal{L}_{y}$

- Dual-CFM Decoder

CosyVoice는 Optimal-Transport Conditional Flow Matching (OT-CFM)을 사용하여 DiffVC, Diff-HierVC에 비해 faster generation, easier training을 제공함
- 따라서 논문은 PitchCFM, VoiceCFM을 포함하는 Dual-CFM decoder를 구성함
- 이때 PitchCFM은 source $F0$를 target speaker style로 convert 하고, 해당 $F0$는 VoiceCFM에 전달되어 target speaker style로 speech를 hierarchically convert 함
PitchCFM
- PitchCFM은 OT-CFM에 기반한 pitch generator로써 standard Gaussian distribution $Z_{0}$에서 $F0$ sample $Z_{1}$으로의 transformation $Z_{t}$를 학습함:
  (Eq. 4) $Z_{t}=(1-(1-\sigma)t)Z_{0}+tZ_{1}$
  - $Z_{t}$ : Ordinary Differential Equation (ODE)으로 동작하고, reverse process에서 original pitch contour를 recover 하기 위해 denoising을 수행함
- 이를 위해 논문은 Optimal-Transport (OT) flow를 정의한 다음, neural network $\text{NN}_{pitch}$가 vector field $v_{t}$를 estimate 하도록 함:
  (Eq. 5) $v_{t}(Z_{t},t)=\frac{d}{dt}Z_{t}=Z_{1}-(1-\sigma)Z_{0}$
  - $\sigma$ : constant
- Speech token $c$는 Conformer를 통해 encoding 된 다음, $F0$ embedding의 length와 match 되도록 upsampling 됨
  1. 이후 encoded feature $c$, target speaker embedding $s$, source $F0$ embedding $f$는 pitch encoder에 전달되어 pitch embedding $Z_{1}$을 생성함
  2. $Z_{1}$에서 생성된 sample $Z_{t}$는 vector field prediction을 위한 U-Net-based pitch score estimator에 전달됨:
    (Eq. 6) $v_{t}^{pred}=\text{NN}_{pitch}(Z_{t},t;s)$
- Pitch encoder는 source speaker의 normalized $F0$를 pitch representation $Z_{1}$으로 변환함
  - 이때 $Z_{1}$을 PitchCFM의 prior distribution으로 활용하기 위해 pitch reconstruction loss로 pitch representation을 regularize 함:
    (Eq. 7) $\mathcal{L}_{pitch}=||v_{t}^{pred}-v_{t}||^{2}$
  - 한편으로 speech token에는 limited timbre information이 포함되어 있으므로 cross-attention을 사용하여 reference audio에서 target speaker style을 추출함
    - 추론 시 pitch encoder의 pitch embedding $Z_{1}$은 PitchCFM의 prior로 사용되어 target speaker style로 refined $F0$를 생성함
VoiceCFM
- VoiceCFM은 OT-CFM-based mel-spectrogram generator로써 content, target $F0$, target speaker style로부터 high-quality speech를 생성함
  - PitchCFM의 refined $F0$, target speaker style $s$를 condition으로 하여 speaker adaptation capacity를 maximize 함
- 먼저 (Eq. 8)을 통해 standard Gaussian distribution $x_{0}$에서 mel-spectrogram distribution $x_{1}$으로의 transformation $x_{t}$를 학습함:
  (Eq. 8) $x_{t}=(1-(1-\sigma)t)x_{0}+tx_{1}$
  - $x_{t}$ : ODE를 통해 동작
- 그러면 OT-flow를 정의하고 $\text{NN}_{mel}$이 vector field $v_{t}$를 estimate 하도록 force 할 수 있음:
  (Eq. 9) $v_{t}(x_{t},t)=\frac{d}{dt}x_{t}=x_{1}-(1-\sigma)x_{0}$
- PitchCFM과 마찬가지로 speaker embedding $s$, refined $F0$ embedding $f_{r}$은 vector field prediction을 위한 U-Net-based Mel-Spectrogram score estimator에 전달됨:
  (Eq. 10) $v_{t}^{pred}=\text{NN}_{mel}(x_{t},t;s,f_{r})$
- 결과적으로 training objective는:
  (Eq. 11) $\mathcal{L}_{mel}=||v_{t}^{pred}-v_{t}||^{2}$
- 추론 시에는 source speaker의 speech token이 prior로 사용되고, VoiceCFM은 target speaker style과 PitchCFM의 refined $F0$로 condition 된 converted speech를 생성함

3. Experiments

- Settings

Dataset : LibriTTS, VCTK
Comparisons : UUVC, DiffVC, Diff-HierVC, CosyVoice

- Results

전체적으로 CycleFlow의 성능이 가장 우수함

Analysis on $F0$ Prediction
- CycleFlow는 average pitch 측면에서 target speech와 close 함

Evaluation of Speaker Similarity
- Speaker similarity 측면에서도 CycleFlow가 가장 뛰어남

Ablation Study
- 각 component를 제거하는 경우 성능 저하가 발생함

'Paper > Conversion' 카테고리의 다른 글

[Paper 리뷰] NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis (0)	2025.05.06
[Paper 리뷰] Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations (0)	2025.05.05
[Paper 리뷰] VoicePrompter: Robust Zero-Shot Voice Conversion with Voice Prompt and Conditional Flow Matching (0)	2025.04.07
[Paper 리뷰] ZSVC: Zero-Shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training (0)	2025.03.28
[Paper 리뷰] kNN-VC: Voice Conversion with Just Nearest Neighbors (2)	2025.03.24

최근에 올라온 글

최근에 달린 댓글

« 2026/03 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] CycleFlow: Leveraging Cycle Consistency in Flow Matching for Speaker Style Adaptation

CycleFlow: Leveraging Cycle Consistency in Flow Matching for Speaker Style Adaptation

1. Introduction

2. Method

- Speech Disentanglement

- Cycle Consistency Regularization

- Dual-CFM Decoder

3. Experiments

- Settings

- Results

'Paper > Conversion' 카테고리의 다른 글

티스토리툴바