[Paper 리뷰] AdaptVC: High Quality Voice Conversion with Adaptive Learning

티스토리 뷰

Paper/Conversion

[Paper 리뷰] AdaptVC: High Quality Voice Conversion with Adaptive Learning

feVeRin 2025. 5. 9. 15:58

AdaptVC: High Quality Voice Conversion with Adaptive Learning

Voice conversion을 위해서는 source에서 disentangled linguistic content를 추출하고 reference에서 voice style을 추출할 수 있어야 함
AdaptVC
- Adapter를 활용하여 self-supervised speech feature를 tuning 해 content, speaker를 효과적으로 disentangle
- Cross-attention speaker conditioning과 conditional flow matching을 활용하여 synthesis quality를 향상
논문 (ICASSP 2025) : Paper Link

1. Introduction

Voice Conversion (VC)는 original linguistic content를 preserve 하면서 source speaker voice를 target speaker voice로 convert 하는 것을 목표로 함
- 이를 위해서는 linguisitc, speaker attribute를 효과적으로 disentangle 해야 함
  1. 대표적으로 AutoVC는 information bottleneck layer를 가지는 autoencoder architecture를 활용하고 DiffVC는 diffusion mechanism을 기반으로 maximum likelihood sampling을 도입함
  2. BUT, 기존 방식은 여전히 content information loss와 resemblance의 한계가 있음
- 한편으로 HuBERT, Wav2Vec 2.0, WavLM, XLS-R과 같은 Self-Supervised Learning (SSL) method는 acoustic, linguistic information을 효과적으로 추출할 수 있음
  - 대표적으로 NANSY, DDDM-VC, kNN-VC 등은 해당 SSL representation을 활용하여 human-like conversion quality를 달성함
- BUT, SSL의 indermediate layer에 대한 heuristic selection, parameter search 과정이 필요함

-> 그래서 다양한 SSL representation을 효과적으로 tuning 할 수 있는 AdaptVC를 제안

AdaptVC
- Adapter를 활용하여 SSL의 rich representation을 tuning
  - 특히 SSL model의 intermediate layer output을 weighted summation으로 combine 하고 richer representation을 생성하도록 automatically guide 함
- 추가적으로 Optimal-Transport Conditional Flow Matching (OT-CFM) decoder와 Cross-attention speaker conditioning을 활용하여 detailed speaker characteristic을 modeling

< Overall of AdaptVC >

SSL representation에 대한 adapter와 OT-CFM을 활용한 zero-shot VC model
결과적으로 기존보다 뛰어난 conversion 성능을 달성

2. Method

AdaptVC는 encoder-decoder architecture를 따름
- Source, reference utterance는 pre-trained speech SSL model인 HuBERT로 구성된 개별 encoder에 전달됨
  1. 추가적으로 all intermediate layer output을 combine 하는 adapter가 사용됨
  2. Encoder adapter에는 weighted summation에 대한 coefficient 역할을 하는 learnable weight가 포함되어 content/speaker-only information 추출을 maximize 하도록 udpate 됨
- 이후 encoded content feature는 encoded speaker feature로 condition 된 U-Net-based CFM decoder로 전달되어 converted speech에 대한 mel-spectrogram을 생성함

- Content Encoder

Content encoder는 linguistic feature를 추출하고 speaker-specific attribute의 영향을 minimize 하는 것을 목표로 함
- 따라서 논문은 speaker aspect disentangling을 further guide 하도록 adapter 다음에 Vector Quantization (VQ) layer를 추가함
  - 해당 latent feature의 quantization은 discrete, compact representation을 생성함
- Speech encoding 측면에서 adapter output은 다양한 speaker의 similar content information을 closest embedding에 mapping 하도록 guide 됨
  - 결과적으로 speaker와 independent 하게 accurate linguistic information을 생성할 수 있음

- Speaker Encoder

Speaker encoder는 linguistic content와 independent 한 rich speaker feature를 생성하는 것을 목표로 함
- 이때 논문은 speaker information의 single vector를 활용하는 기존 방식과 달리, frame-wise speaker feature를 사용하여 서로 다른 utterance의 time-varying timbre를 capture 함
- 한편으로 reference speech utterance는 HuBERT에 전달되고 adapter에서 생성된 final representation은 decoder에 전달되어 content-only feature를 rich acoustic feature로 변환함

- CFM Decoder

Decoder는 content, speaker feature를 receive 하여 converted speech에 대한 mel-spectrogram을 생성함
- 이때 논문은 data, target distribution 간의 mapping과 match 되도록 transformation을 regressing 하는 OT-CFM objective를 활용함
- 구조적으로는 Transformer-based U-Net architecture를 기반으로 speaker condition을 제공함
  - 특히 Transformer block의 self-attention layer는 cross-attention layer로 replace 되고 encoded speaker feature는 key, value로 사용됨
- 결과적으로 cross-attention을 통한 multiple conditioning을 통해 decoder는 다양한 speaker의 acoustic detail을 modeling 할 수 있음
  - Speaker encoder의 adapter는 HuBERT의 multiple output을 combining 하여 rich speaker information을 생성하도록 optimize 됨

- Training Objective

AdaptVC는 commitment loss, prior loss, OT-CFM loss로 training 됨
- 먼저 commitment loss는 VQ layer input이 codebook vector에 commit 되도록 함:
  (Eq. 1) $ \mathcal{L}_{commit}=\text{MSE}\left(\mathbf{h}_{cont},\text{sg}[\mathbf{e}]\right)$
  - $\text{MSE}$ : Mean Squared Error, $\text{sg}[\cdot] $ : stop-gradient operator
  - $\mathbf{h}_{cont}$ : content encoder 내 adpater output, $\mathbf{e}$ : VQ layer의 codebook vector
- Prior loss는 prior distribution과 mel-spectrogram 간의 log-likelihood를 minimize 함:
  (Eq. 2) $\mathcal{L}_{prior}=-\sum_{i=1}^{T}\log \varphi(\mathbf{x}_{i};\mu_{i},I)$
  - $\mathbf{x}$ : target mel-spectrogram, $\varphi (\cdot ;\mu_{i},I)$ : $\mathcal{N}(\mu_{i},I)$의 probability density function, $T$ : temporal length
  - 해당 loss는 VQ layer의 codebook vector가 discrete, nuanced information을 represent 하도록 함
- 추가적으로 Optimal Transport (OT)를 통해 linear trajectory를 가지는 vector field를 estimate 함:
  (Eq. 3) $\mathcal{L}_{dec}=\mathbb{E}_{t,q(\mathbf{x}_{1}),p_{0}(\mathbf{x}_{0})}\left|\left| u_{t}^{OT}\left( \phi_{t}^{OT}(\mathbf{x}_{0})|\mathbf{x}_{1}\right)- v_{t}\left(\phi_{t}^{OT}\left( \mathbf{x}_{0}\right)|\mu,\mathbf{h}_{spk}; \theta\right)\right|\right|^{2}$
  - $\theta$ : network parameter, $\phi_{t}^{OT}(\mathbf{x})=(1-(1-\sigma_{\min})t)\mathbf{x}_{0}+t\mathbf{x}_{1}$ : source, target distribution을 mapping 하는 flow
  - $u_{t}$ : prior distribution $p_{0}$에서 target data distribution $p_{t}$ 까지의 approximate path를 생성하는 known vector field
  - $\mathbf{h}_{spk}$ : speaker encoder에서 얻어지는 continuous speaker feature
- 결과적으로 얻어지는 total objective는:
  (Eq. 4) $\mathcal{L}_{total}=\mathcal{L}_{commit}+\mathcal{L}_{prior}+\mathcal{L}_{dec}$

3. Experiments

- Settings

Dataset : LibriTTS, VCTK
Comparisons : kNN-VC, DDDM-VC, DiffVC

- Results

전체적으로 AdaptVC의 성능이 가장 우수함

Analysis on Adapter Weights
- Content encoder의 adapter는 HuBERT의 second, last output을 사용함
- Speaker encoder의 adapter는 first layer에서 가장 큰 weight를 가짐

Ablation Study
- 각 component를 제거하는 경우 성능 저하가 발생함

'Paper > Conversion' 카테고리의 다른 글

[Paper 리뷰] SEVC: Voice Conversion via Structural Entropy (0)	2025.05.30
[Paper 리뷰] CASC-XVC: Zero-Shot Cross-Lingual Voice Conversion with Content Accordant and Speaker Contrastive Losses (0)	2025.05.19
[Paper 리뷰] NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis (0)	2025.05.06
[Paper 리뷰] Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations (0)	2025.05.05
[Paper 리뷰] CycleFlow: Leveraging Cycle Consistency in Flow Matching for Speaker Style Adaptation (0)	2025.04.16

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] AdaptVC: High Quality Voice Conversion with Adaptive Learning

AdaptVC: High Quality Voice Conversion with Adaptive Learning

1. Introduction

2. Method

- Content Encoder

- Speaker Encoder

- CFM Decoder

- Training Objective

3. Experiments

- Settings

- Results

'Paper > Conversion' 카테고리의 다른 글

티스토리툴바