[Paper 리뷰] VQVC: One-Shot Voice Conversion by Vector Quantization

티스토리 뷰

Paper/Conversion

[Paper 리뷰] VQVC: One-Shot Voice Conversion by Vector Quantization

feVeRin 2024. 8. 10. 08:28

VQVC: One-Shot Voice Conversion by Vector Quantization

Speaker label에 대한 supervision 없이 voice conversion을 수행할 수 있음
VQVC
- Content embedding을 discrete code로 모델링하고 quantize-before/quantize-after vector 간의 차이를 speaker embedding으로 취급
- Vector quantization에 대한 reconstruction loss 만으로 content/speaker information에 대한 strong disentanglement를 달성
논문 (ICASSP 2020) : Paper Link

1. Introduction

Voice Conversion (VC)는 linguistic information을 변경하지 않고 source speaker의 음성을 target speaker의 음성으로 변환하는 것을 목표로 함
- 이때 target speaker를 imitate 하기 위해 VC system은 source speaker의 tone, accent, pronunciation 등을 변경할 수 있어야 함
- VC는 크게 supervised/unsupervised method로 나눌 수 있음
  1. Supervised method에서는 GMM 등을 활용해 specific target distribution으로의 mapping을 학습
    - BUT, training data에 대한 parallel, frame-level alignment가 필요하고, flexibility가 부족함
  2. Unsupervised method는 GAN, VAE 등을 활용해 VC를 수행하는 방식
    - BUT, unseen data에 대한 합성이 어려움
- 한편으로 one-shot learning은 source speaker와 target speaker에 대한 하나의 utterance를 활용해 unseen speaker에 대한 고품질 변환을 지원할 수 있음
  1. 이때 VC system은 content, speaker information을 disentangle 하는 방식을 학습해야 함
  2. 이를 위해 AdaIN-VC는 Instance Normalization (IN) 기반의 one-shot learning을 활용함
    - BUT, Vector Quantization (VQ)-based AutoEncoder에서 학습된 discrete code는 IN보다 더 뛰어난 disentangling을 제공할 수 있음

-> 그래서 VQ를 활용한 one-shot VC system인 VQVC를 제안

VQVC
- VQ를 활용해 content와 speaker information을 disetangle 하고, 하나의 encoder만을 사용해 content와 speaker information에 해당하는 latent representation을 추출
  - Encoder는 latent representation을 discrete/continuous part로 decomposition 하고, decoder는 두 part를 addition 하여 input audio를 reconstruction 함
- 이때 reconstruction loss만을 활용해 discrete part가 phonetic information을 automatically capture 하고, continuous part가 speaker information을 얻을 수 있도록 함

< Overall of VQVC >

Content와 speaker information을 disetangle 하는 VQ method를 도입
결과적으로 supervision 없이도 효과적인 one-shot VC 성능을 달성

2. Method

VQVC에서 VQ model은 continuous space와 discrete code 간의 차이로 speaker information을 represent 하는 법을 학습함
- 일반적으로 VQ-based AutoEncoder에는 vector (code) set로 구성된 codebook $\mathcal{Q}$가 존재함
  1. 이때 well-trained VQ-based model은 해당 code를 phoneme-related descriptor로 interpret 할 수 있음
    - 즉, 동일한 sentence를 사용하는 두 speaker는 유사한 code에 project 됨
  2. Quantize-after point는 content information과 highly-relate 되어 있으므로 quantization step에서 discard 되는 information은 speaker information으로 취급할 수 있음
    - 따라서 VQ model은 continuous space와 discrete code 간의 차이로 speaker information을 학습하게 됨
- $x$를 training data $\mathcal{X}$의 모든 segment collection에서 sampling 한 acoustic feature segment, $\mathcal{Q}$를 vector set으로 구성된 trainable codebook이라고 하자
  1. 그러면 논문은 encoder-decoder architecture를 기반으로 하나의 encoder 만으로 content embedding과 speaker embedding을 모두 추출함
  2. $\text{enc}$를 encoder, $\text{dec}$를 decoder라고 했을 때, quantization function $\text{Quantize}$는:
    (Eq. 1) $\text{Quantize}(\hat{x})=q_{j},\,\,\, q_{j}=\arg\min_{q\in\mathcal{Q}}(|| \hat{x}-q||_{2}^{2})$
    - $\text{Quantize}$ function은 $\hat{x}$를 input으로 하여 $\hat{x}$에 가장 가까운 vector $q_{j}$를 output 함
  3. Content embedding $C_{x}$와 speaker embedding $S_{x}$는:
    (Eq. 2) $C_{x}=\text{Quantize}(\text{enc}(x)),\,\,\, S_{x}=\mathbb{E}_{t}[\text{enc}(x)-C_{x}]$
    - 기댓값 $\mathbb{E}_{t}$는 segment $x$의 global information을 나타내는 latent space의 segment length를 취함
- 여기서 $S_{x}$는 speaker information을 나타내므로 reconstruction loss는 다음과 같이 얻어짐:
  (Eq. 3) $\mathcal{L}_{rec}(\mathcal{Q},\theta_{enc},\theta_{dec})=\mathbb{E}_{x\in \mathcal{X}}[|| \text{dec}(C_{x}+S_{x})-x ||_{1}^{1}]$
  - $S_{x}$는 $\text{enc}(x)$와 $C_{x}$ 간의 차에 대한 평균으로 얻어지므로 $C_{x}$에 $S_{x}$를 더하는 것이 더 reasonable 함
- 한편으로 discrete code와 continuous space 간의 distance를 최소화하는 latent loss $\mathcal{L}_{latent}$를 추가하여 discrete code가 project 된 point를 represent 할 수 있음:
  (Eq. 4) $\mathcal {L}_{latent}(\theta_{enc})=\mathbb {E}_{t}[|| \text{enc}(x)-C_{x}||_{2}^{2}]$
  - $\mathbb{E}_{t}$ : segment length에 대한 기댓값
  - Codebook $\mathcal{Q}$가 아닌 $\mathcal{L}_{latent}$를 최소화하도록 $\text{enc}$만 업데이트하면, $C_{x}$가 speaker information을 학습하지 못하도록 할 수 있음
- 결과적으로 VQVC의 total loss는:
  (Eq. 5) $\mathcal{L}=\mathcal{L}_{rec}+\lambda\mathcal{L}_{latent}$
- 이때 논문에서는 stability를 위해 $C_{x},S_{x},\mathcal{Q}$의 모든 element의 $L2$-norm을 하나로 normalize 함
  1. 추가적으로 speaker에 대한 더 많은 information을 제공할 수 있도록 VQ 이전에 IN을 수행
  2. Source $x$에서 target $y$로의 conversion은 $S_{x}$를 $S_{y}$로 대체하여 수행되고, decoder에 전달되기 전에 $C_{x}$를 추가

3. Experiments

- Settings

Dataset : VCTK
Comparisons : AutoVC

- Results

Ability of Disentanglement
- IN을 포함한 VQVC는 VQ나 IN 만을 각각 사용할 때 보다 더 나은 disentanglement를 보임
- 특히 IN을 포함한 VQ는 모든 codebook size $Q_{x}$에서 speaker identification accuracy를 절반으로 떨어트림
  - 즉, constraint 없이 discrete code가 speaker information을 학습할 수 있음

Speaker Embedding Visualization
- Seen/unseen speaker에 대해 segment length가 120인 $S_{x}$를 생성하고, t-SNE를 통해 $Q32$의 결과를 시각화하면,
- Encoder에 explicit objective나 constraint를 추가하지 않았음에도 speaker embedding이 효과적으로 학습됨

Ablation Study on Architecture
- Codebook normalization과 IN의 효과를 확인하기 위해 ablation study를 수행
- Codebook $\mathcal{Q}$에 대한 normalization과 quantization layer 이전에 IN layer를 배치하면 training speed와 성능이 크게 향상됨

Subjective Evaluation
- 주관적 평가 측면에서도 VQVC는 뛰어난 conversion 성능을 보임

'Paper > Conversion' 카테고리의 다른 글

[Paper 리뷰] ALO-VC: Any-to-Any Low-Latency One-Shot Voice Conversion (0)	2024.08.12
[Paper 리뷰] AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss (0)	2024.08.11
[Paper 리뷰] VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net Architecture (0)	2024.08.09
[Paper 리뷰] One-Shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization (0)	2024.08.08
[Paper 리뷰] AGAIN-VC: A One-Shot Voice Conversion Using Activation Guidance and Adaptive Instance Normalization (0)	2024.08.07

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] VQVC: One-Shot Voice Conversion by Vector Quantization

VQVC: One-Shot Voice Conversion by Vector Quantization

1. Introduction

2. Method

3. Experiments

- Settings

- Results

'Paper > Conversion' 카테고리의 다른 글

티스토리툴바