[Paper 리뷰] Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion

티스토리 뷰

Paper/Conversion

[Paper 리뷰] Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion

feVeRin 2025. 9. 13. 07:50

Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion

Zero-shot Voice Conversion은 source speaker의 speaking style을 accurately replicate 하는데 한계가 있음
Discl-VC
- Content, prosody information을 self-supervised speech representation으로부터 disentangle
- Flow Matching Transformer와 in-context learning을 통해 target speaker voice를 합성
논문 (INTERSPEECH 2025) : Paper Link

1. Introduction

Voice Conversion (VC)는 source speaker voice를 target speaker로 transform 하는 것을 목표로 함
- 이를 위해서는 speech에서 linguistic content information과 speaker information을 decoupling해야 함
  1. 대표적으로 AutoVC는 bottleneck을 사용하여 attribute disentanglement를 달성함
  2. 한편으로 ContentVec과 같은 self-supervised representation을 활용하여 VC를 수행할 수도 있음
- 이때 VC model은 Diff-HierVC, StableVC, Vevo와 같이 content/speaker information 외에도 speaking stlye을 control 해야 함
  - BUT, 여전히 converted speech는 naturalness와 prosody similarity 측면에서 한계가 있음

-> 그래서 더 나은 speech disentangling과 attribute control이 가능한 Discl-VC를 제안

Discl-VC
- Self-supervised representation을 활용하여 content와 prosody를 disentangle
- Flow Matching과 In-Context learning을 기반으로한 Flow Matching Transformer를 도입하여 speech modeling을 개선
- 추가적으로 mask-and-predict paradigm을 따르는 Prosody Mask Transformer를 사용해 generated speech의 prosody token을 predict

< Overall of Discl-VC >

Flow Matching Transformer와 Prosody Mask Transformer를 활용한 Zero-shot VC model
결과적으로 기존보다 우수한 conversion 성능을 달성

2. Method

- Speech Disentanglement

Discl-VC의 Content-Prosody Extractor와 Content Encoder는 pre-trained self-supervised model에 해당함
- 특히 Content Encoder로는 HuBERT-large를 사용하고 24-th layer의 continuous representation에 1024 cluster의 $k$-means clustering을 적용함
  1. Content Extractor에서 얻어진 content token은 timbre, prosody information을 filtering 하여 semantic information만 포함하는 것으로 볼 수 있음
  2. 추가적으로 adjacent duplicate token을 remove 하는 deduplication process를 통해 duration-related prosody information을 further eliminate 함
    - 이를 통해 speech token duration을 re-predict 하여 generated speech의 prosody를 개선할 수 있음
- Content-Prosody Extractor의 경우 ContentVec을 사용함
  - 해당 output은 speaker information이 이미 disentangle 되어 있고 거의 모든 content, prosody information을 가지고 있으므로 disentangling process의 complexity를 줄일 수 있음
- 한편 Prosody information 추출을 위해 논문은 Vector Quantization (VQ) Prosody Encoder를 도입함
  1. 구조적으로는 2개의 convolution stack, Inversed Length Regulator (Inversed LR), VQ layer로 구성됨
    - Inversed LR은 token duration을 기반으로 sequence length를 adjust 함
  2. VQ layer는 content information filtering을 위한 bottleneck으로써 사용됨
    - 특히 codebook collapse를 방지하기 위해 SimVQ를 채택함
- 이때 codebook vector는 randomly initialize 되고 training 중에 update 되지 않음
  - 대신 quantized result $z_{q}$를 생성하는 linear layer $W$를 codebook vector $q$에 적용함
- 결과적으로 vector quantization loss는:
  (Eq. 1) $\mathcal{L}_{SimVQ}=\lambda ||qW-\text{sg}[z]||^{2}+||z-\text{sg}[qW]||^{2}$
  - $\lambda$ : hyperparameter
- 추가적으로 $F0$는 prosody의 crucial component에 해당하므로, prosody information을 효과적으로 추출할 수 있도록 $\mathcal{L}_{F0}$를 도입함
  1. 먼저 각 speaker의 mean, variance를 compute 하고, ground-truth $F0$에 $Z$-score normalization을 적용하여 speaker-independent prosody를 얻음
  2. 이후 predicted value와 ground-truth 간의 Smooth $L1$ loss를 compute 함
  3. 해당 predicted duration을 기반으로 token length를 expand 하기 위해 FastSpeech2와 같은 Duration Predictor, Length Regulator (DP&LR)를 적용함

- In-Context Learning Modeling

Flow Matching Transformer
- Flow Matching은 Ordinary Differential Equation (ODE)를 학습하여 simple distribution을 target data distribution으로 mapping 함
  - 이때 initial distribution을 data distribution으로 gradually transform 하는 time-dependent vector field $v_{t}(x;\theta)$를 정의해, true vector field $u_{t}(x)$와 learned vector field $v_{t}(x;\theta)$ 간의 discrepancy를 minimize 함
- 논문에서는 Optimal Transport Flow Matching objective를 채택하여 initial, target distribution 간의 flow가 straight line으로 modeling 되도록 함:
  (Eq. 2) $ \mathcal{L}_{OT\text{-}CFM}(\theta)=\mathbb{E}_{t,q(x_{1}),p(x_{0})}\left[ ||v_{t}\left( (1-t)x_{0}+tx_{1}\right)-(x_{1}-x_{0})||^{2}\right]$
- VoiceBox는 text-guided speech-infilling task를 위해 Transformer와 flow matching을 integrate 함
  1. 이를 따라 논문은 mask-and-predict approach를 채택하고 in-context learning을 사용하여 masked acoustic feature를 생성함
  2. 먼저 training 중에 time step $t$를 sampling 하고 real mel-spectrogram에 certain level의 Gaussian noise를 add 하여 noisy version을 얻음
    - 이후 noisy mel-spectrogram을 input prosody, content feature와 masked mel-spectrogram과 함께 Flow Matching Transformer에 전달함
  3. 해당 Flow Matching Transformer는 complete prosody, content information과 surrounding mel-spectrogram을 기반으로 masked mel-spectrogram을 predict 함
- 구조적으로 Flow Matching Transformer는 AdaLN-zero를 가진 DiT block을 사용함
  - 추가적으로 prosody, content token과 masked mel-spectrogram을 $0.2$의 probability로 remove 하는 Classifier-Free Guidance를 적용하여 speech sample quality를 향상함
Prosody Mask Transformer
- Controllable VC task를 위해서는 reference audio의 speaking style을 학습해야 함
  - 이를 위해 논문은 reference audio를 기반으로 source speech의 prosody token을 predict 하는 non-autoregressive Prosody Mask Transformer를 도입함
- 해당 masked token modeling은 모든 token을 parallel predict 하고 low-confidence output을 iteratively refine 함
  1. 먼저 prosody token sequence가 주어지면 special token을 도입하고 sine schedule에 따라 selected token을 해당 token으로 replace 함
  2. 이후 module은 full content token sequence와 unmasked prosody token을 기반으로 masked token을 predict 함
    - 이때 module은 cross-entropy loss를 사용하여 masked portion을 optimize 함
- 추론 시에는 full reference audio의 prompt token과 maksed prosody token을 사용하여 prosody token mask를 iteratively unmask 함
  - 각 step에서 low-confidence token은 time step에 따라 re-mask 됨
- 구조적으로 Prosody Mask Transformer는 Flow Matching Transformer와 동일하고, condition을 $0.2$ probability로 remove 하는 Classifier-Free Guidance를 사용함

- Model Training

Discl-VC는 2-stage training process를 활용함
- 먼저 논문은 VQ prosody encoder, DP&LR, flow matching Transformer를 jointly training 하여 speech decoupling과 high-quality zero-shot VC를 학습함:
  (Eq. 3) $\mathcal{L}_{stage1}=\mathcal{L}_{Dur}+\mathcal{L}_{SimVQ}+\mathcal{L}_{FMT}+\mathcal{L}_{F0}$
  - $\mathcal{L}_{Dur}$ : duration에 대한 MSE, $\mathcal{L}_{FMT}$ : flow matching loss
- 이후 trained VQ prosody encoder를 사용해 ground-truth prosody token을 추출하여 content token과 함께 prosody mask Transformer를 training 하는 데 사용함:
  (Eq. 4) $\mathcal{L}_{stage2}=\mathcal{L}_{PMT}$
  - $\mathcal{L}_{PMT}$ : prosody mask Transformer의 cross-entropy loss
  - Training 이후 해당 module은 reference audio를 기반으로 prosody token을 predict 하는 데 사용됨

3. Experiments

- Settings

Dataset : LibriLight
Comparisons : FACodec, Vevo

- Results

전체적으로 Discl-VC의 성능이 가장 뛰어남

Discl-VC는 zero-shot VC에서 prosody-preserving, prosody-converting이 가능함

Subjective evaluation 측면에서도 Discl-VC가 가장 뛰어남

Mel-spectrogram 측면에서도 더 나은 disentanglement를 보임

Ablation Study
- 각 component를 제거하는 경우 성능 저하가 발생함

'Paper > Conversion' 카테고리의 다른 글

[Paper 리뷰] REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion (0)	2025.09.23
[Paper 리뷰] ZSDEVC: Zero-Shot Diffusion-based Emotional Voice Conversion with Disentangled Mechanism (0)	2025.09.17
[Paper 리뷰] DiffEmotionVC: A Dual-Granularity Disentangled Diffusion Framework for Any-to-Any Emotion Voice Conversion (0)	2025.09.08
[Paper 리뷰] Training-Free Voice Conversion with Factorized Optimal Transport (0)	2025.09.02
[Paper 리뷰] FasterVoiceGrad: Faster One-Step Diffusion-based Voice Conversion with Adversarial Diffusion Conversion Distillation (0)	2025.08.24

최근에 올라온 글

최근에 달린 댓글

« 2025/12 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion

Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion

1. Introduction

2. Method

- Speech Disentanglement

- In-Context Learning Modeling

- Model Training

3. Experiments

- Settings

- Results

'Paper > Conversion' 카테고리의 다른 글

티스토리툴바