[Paper 리뷰] Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy

티스토리 뷰

Paper/Conversion

[Paper 리뷰] Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy

feVeRin 2024. 9. 16. 09:55

Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy

Voice conversion는 decoupling process의 semantic loss와 training-inference mismatch로 인해 품질의 한계가 있음
Vec-Tok-VC+
- Two-layer clustering process로 semantic content extraction을 향상하기 위해, residual-enhanced $K$-means decoupler를 도입
- Teacher-guided refinement를 사용하여 training-inference mismatch를 완화하고 dual-mode training strategy를 설계
- 추가적으로 speaker similarity와 content accuracy를 향상하기 위해 layer-wise output을 constrain 하는 multi-codebook progressive loss를 도입
논문 (INTERSPEECH 2024) : Paper Link

1. Introduction

Voice Conversion (VC)는 linguistic content를 변경하지 않고 source speaker를 target speaker로 변환하는 것을 목표로 함
- 특히 zero-shot VC는 대용량의 data collection을 사용하지 않고 하나의 utterance 만으로 변환을 수행함
  1. 이때 효과적인 zero-shot VC를 위해서는 unseen speaker timbre를 모델링하고 source semantic content를 decoupling해야 함
    - 대표적으로 VQVC+, AutoVC는 information bottleneck을 활용하여 content representation으로부터 speaker timbre를 분리하는 방식
    - VQMIVC와 같이 mutual information constraint를 사용하거나 CycleGAN-VC와 같이 adversarial training을 활용할 수도 있음
  2. BUT, 해당 disentanglement 방식들은 speaker similarity와 quality 측면에서 trade-off 문제를 가짐
- 한편으로 FragmentVC는 multi-reference speech에서 finer-grained way로 speaker timbre를 capture하여 multi-level/time-varying representation을 얻어 speaker timbre를 모델링함
  1. 특히 VC training에서 explict disentanglement를 활용하지 않고, training 이전에 Speaker Verification (SV) model이나 Automatic Speech Recognition (ASR) model을 사용하여 content를 추출할 수 있음
    - BUT, 제한적인 model capacity로 인해 unseen speaker에 대한 VC 성능이 떨어짐
  2. 한편으로 HuBERT, WavLM과 같은 Self-Supervised Learning (SSL) model은 local general structure를 capture 할 수 있다는 장점이 있음
    - BUT, $K$-Nearest Neighbor나 $K$-means clustering에 기반한 decoupling process는 linguistic content와 speaking variant를 손상시키므로 합성 품질이 저하됨
    - 추가적으로 training-inference mismatch로 인해 speaker timbre와 content information 간의 decoupling을 보장할 수 없음

-> 그래서 Vec-Tok-Codec을 기반으로 residual-enhanced $K$-means cluster를 도입한 Vec-Tok-VC+를 제안

Vec-Tok-VC+
- Residual-enhanced $K$-means decoupler를 통해 3초의 target speaker prompt condition하에서 enhanced semantic feature를 얻음
  - Residual Vector Quantization (RVQ)에 기반한 $K$-means qunatization을 통해 linguistic content의 residual information과 rich speaking variant를 encoding 하여 semantic content를 enhance 함
- 더 나은 decoupling과 training-inference mismatch를 제거하기 위해, teacher-guided refinement process를 통한 dual-mode training strategy를 도입
- 추가적으로 multi-codebook loss를 채택하여 multi-layer modeling 과정에서 information disperse를 방지하고, coarse-to-fine으로 target speech에 fit 되도록 함

< Overall of Vec-Tok-VC+ >

Vec-Tok-Codec를 기반으로 residual-enhanced $K$-means clustering을 결합한 prompt-based zero-shot VC model
결과적으로 기존보다 뛰어난 conversion 성능을 달성

2. Method

- System Overview

Vec-Tok-Codec은 WavLM을 사용하여 continous acoustic feature를 추출하고 300-category $K$-means clustering을 통해 acoustic feature에서 semantic feature를 decouple 함
- Vec-Tok-VC는 Vec-Tok-Codec을 기반으로 temporal axis를 따라 target speaker의 acoustic feature prompt와 source speech의 semantic feature를 concatenate 하고, Conformer-based converter에 전달하여 zero-shot VC를 수행함
- 결과적으로 Vec-Tok-VC+는 Vec-Tok-VC를 개선하여 Residual-enhanced $K$-means Decoupler, Prompt-based Conformer Converter, Teacher Module로 구성됨
  1. Feature extractor는 semantic content와 speaker timbre information이 포함된 continuous SSL feature를 추출함
    - 이후 Vec-Tok-VC+는 해당 SSL feature를 기반으로 구축됨
  2. Decoupler는 content information에서 speaker timbre를 추출하기 위해 residual-enhanced $K$-means quantization을 활용하여 enhanced content representation을 얻음
  3. Conformer-based converter는 short speaker utterance를 speaker prompt로 사용하고, source speech에서 content representation을 통해 target SSL feature를 예측함
    - 이때 training-inference mismatch를 완화하기 위해 teacher module이 사용됨
  4. 최종적으로 HiFi-GAN vocoder를 통해 SSL feature로부터 waveform을 reconstruct 함
- Feature Extraction
  1. 논문에서는 spectrogram이나 speech codec 대신 Wav2Vec 2.0의 multi-linugal variant인 XLSR model의 continuous SSL feature로 speech를 represent 함
  2. 해당 SSL feature는 semantic, speaker information의 richness로 인해 고품질 reconstruction이 가능하므로, reconstruction을 위해 XLSR-base vocoder를 채택함
- Training Stage
  1. 아래 그림의 (a)와 같이 Vec-Tok-VC+는 explicitly given speaker prompt에서 source SSL feature를 target SSL feature condition으로 변환하도록 training 됨
  2. Training 중에 source, target SSL feature는 동일한 content를 가지지만, teacher guidance가 activate 되면 target SSL feature는 teacher module로부터 생성됨
    - Speaker prompt는 target feature sequence에서 randomly select 됨
- Zero-Shot Inference
  - 아래 그림의 (b)와 같이 target speaker utterance에서 SSL feature를 speaker prompt로 제공하면 Vec-Tok-VC+는 source semantic content와 target speaker timbre를 포함하는 converted speech를 output 함

- Residual-enhanced $K$-Means Decoupler

Zero-shot VC에서 speech component를 decoupling 하는 것은 필수적임
- 일반적으로 continuous SSL feature에는 rich semantic과 speaker timbre information이 포함되어 있으므로, $K$-means quantization을 통해 information bottleneck을 형성하여 content information으로부터 speaker timbre를 추출함
  - BUT, 이 과정에서 linguistic information이 손상되고 speaking variant가 loss 됨
- 따라서 논문은 아래 그림의 (a)와 같이 Residual Vector Quantization (RVQ) mechanism을 기반으로 2개의 $K$-means process를 통해 residual-enhanced clustering을 적용하여 enhanced content representation을 얻음
  1. 구체적으로 first 1024-category $K$-means는 source SSL feature를 content representation으로 quantize 함
  2. 이후 raw continuous SSL feature와 quantize-after feature 간의 residual information을 input으로 하여 second 256-category $K$-means는 linguistic information과 speaking variation을 compensate 함
    - 해당 enhanced content representation은 이후 conversion에서 활용되고, 이때 quantize-after feature는 discrete index가 아닌 centroid vector로 represente 됨

- Dual-mode Training with Teacher-Guided Refinement

대부분의 VC method는 training 중에 reference와 source semantic content가 모두 동일한 utterance로부터 reconstruction을 수행하지만, conversion 시에는 다른 utterance가 제공됨
- 해당 training-inference mismatch로 인해 speaker timbre와 content information 간의 decoupling을 보장하기 어려우므로 성능 저하가 발생함
  - 이때 kNN-VC와 같이 $K$-nearest neighbors matching을 통해 feature similarity에 따라 source SSL representation을 target SSL representation으로 대체하면 VC 성능을 향상할 수 있음
- 따라서 Vec-Tok-VC+는 위 그림의 (b)와 같이 kNN-VC을 기반으로 training에서 conversion을 simulate 하는 dual-mode teacher-guided refinement module을 도입함
  1. 먼저 490 speaker에 대한 7분 분량의 speech utterance를 수집하여 matching pool을 구성함
    - 여기서 각 utterance는 XLSR-based SSL feature로 represent 됨
  2. 이후 conversion mode에서는 하나의 target speaker가 matching pool에서 randomly select 되고, source feature는 kNN matching을 통해 speaker timbre를 가진 pseudo-target feature로 변환됨
  3. 한편으로 reconstruction mode에서 source feature와 target feature는 동일한 speech utterance를 활용함
- 해당 dual mode training process에서 conversion mode와 reconstruction mode는 0.5에서 randomly activate 됨
  - 3-second slice는 target speaker prompt로써 teacher module output에서 randomly select 됨

- Prompt-based Conformer Converter

Converter는 decoupled content representation과 target speaekr utterance를 통해 target speaker timbre를 capture 하고 source content와 fuse 하여 final conversion을 수행함
- Converter는 위 그림의 (c)와 같이 prompt-based speaker modeling이 가능한 multi-layer conformer로 구성됨
- 구체적으로 converter는 여러 conformer layer와 convolution-based postnet으로 구성된 non-autoregressive architecture를 기반으로 함
  1. 먼저 converter의 input 이전에 3-second speaker prompt가 temporal axis를 따라 content embedding 앞에 concatenate 됨
  2. 그러면 converter는 conformer의 in-context learning ability를 활용하여 fine-grained speaker information을 capture 하고 source content와 fuse 할 수 있음
- 이때 Mean Squared Error (MSE) loss $\mathcal{L}_{mse}$는 target/prediction feature 간의 distance를 계산하기 위해 사용됨
  - 추가적으로 생성 품질 향상을 위해 structural similarity loss $\mathcal{L}_{ssim}$을 도입하고, multi-layer modeling 중에 information disprese를 방지하기 위해 multi-codebook progressive constraint를 적용

Multi-Codebook Progressive Constraint
- Bottom layer에서 top layer까지 converter layer의 hidden output은 복잡한 SSL feature를 transfer 할 수 있도록 information richness가 높아야 함
- 따라서 논문은 해당 process를 supervise 하고 content accuracy를 보장하기 위해 converter의 hidden layer에 multi-codebook progressive constraint를 도입함
  1. 먼저 각각 2048, 4096, 8192인 small, medium, large codebook number를 가지는 target feautre에 대해 3가지의 $K$-means clustering을 수행함
    - Small-to-large로 서로 다른 granularity를 가지는 qunatization은 speech에 대한 다양한 information을 encoding 할 수 있음
  2. 이때 small codebook number를 가지는 quantize-after feature는 bottom layer의 hidden ouput을 constrain 하는 데 사용됨
  3. 최종적으로 progressive loss $\mathcal{L}_{pro}$는 hidden output과 quantization result 간의 cross-entropy loss와 같음:
    (Eq. 1) $\mathcal{L}_{pro}=\mathcal{L}_{small}+\mathcal{L}_{medium}+\mathcal{L}_{large}$
- 결과적으로 Vec-Tok-VC+의 total loss function은:
  (Eq. 2) $\mathcal{L}_{total}=\mathcal{L}_{mse}+\mathcal{L}_{ssim}+\mathcal{L}_{pro}$

3. Experiments

- Settings

Dataset : LibriTTS, GigaSpeech, Chinese audiobook dataset
Comparisons : LM-VC, SEF-VC

- Results

전체적으로 Vec-Tok-VC+의 성능이 가장 뛰어남

Ablation study 측면에서 각 component를 제거하는 경우 성능 저하가 발생함

'Paper > Conversion' 카테고리의 다른 글

[Paper 리뷰] DualVC: Dual-mode Voice Conversion Using Intra-model Knowledge Distillation and Hybrid Predictive Coding (0)	2024.09.28
[Paper 리뷰] DualVC2: Dynamic Masked Convolution for Unified Streaming and Non-Streaming Voice Conversion (0)	2024.09.18
[Paper 리뷰] TriAAN-VC: Triple Adaptive Attention Normalization for Any-to-Any Voice Conversion (0)	2024.09.10
[Paper 리뷰] Wav2Vec-VC: Voice Conversion via Hidden Representations of Wav2Vec 2.0 (0)	2024.09.04
[Paper 리뷰] ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-Supervised Speech Representations (0)	2024.09.02

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy

Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy

1. Introduction

2. Method

- System Overview

- Residual-enhanced $K$-Means Decoupler

- Dual-mode Training with Teacher-Guided Refinement

- Prompt-based Conformer Converter

3. Experiments

- Settings

- Results

'Paper > Conversion' 카테고리의 다른 글

티스토리툴바