[Paper 리뷰] FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion

티스토리 뷰

Paper/Conversion

[Paper 리뷰] FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion

feVeRin 2024. 8. 28. 09:18

FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion

기존의 voice conversion은 speaker information이 leak 되거나 많은 양의 annotated data가 필요함
FreeVC
- VITS의 end-to-end framework를 채택하고 text annotation 없이 clean content information을 추출
  - 특히 WavLM feature에 information bottleneck을 impose 하여 content information을 disentangling
- 추출된 content information의 purity를 향상하기 위해 spectrogram-resize based data augmentation을 적용
논문 (ICASSP 2023) : Paper Link

1. Introduction

Voice Conversion (VC)은 linguistic content는 유지하면서 source speaker를 target speaker로 변경하는 것을 목표로 함
- 이때 일반적으로 source, target speech에서 content, speaker information을 disentangle 한 다음, converted speech를 reconstruct 하는 방식을 사용함
  - 결과적으로 우수한 conversion 성능을 달성하기 위해서는, disentanglement ability와 reconstruction ability가 요구됨
- 여기서 content information을 disentangle 하는 방식에 따라 text-based VC, text-free VC로 분류할 수 있음
  1. Text-based VC는 일반적으로 Automatic Speech Recognition (ASR) model을 사용하여 Phonetic PosteriorGram (PPG)를 content representation으로 추출함
    - BUT, ASR model을 training 하기 위해서 상당한 양의 annotated data가 요구됨
  2. Text-free VC에서는 AutoVC와 같은 information bottleneck, VQVC+와 같은 vector quantization, AGAIN-VC와 같은 instance normalization을 사용할 수 있음
    - BUT, source speaker information이 쉽게 leak 될 수 있어 text-based VC 보다 낮은 성능을 보임
- 한편으로 대부분의 VC system은 conversion model과 vocoder로 구성된 2-stage pipeline을 따름
  1. BUT, conversion model에서 예측한 acoustic feature는 vocoder가 학습한 distribution과 다를 수 있으므로 reconstruction 품질을 저하할 수 있음
  2. 이때 VITS는 conditional Variational AutoEncoder (cVAE)의 latent variable을 통해 end-to-end pipeline을 구성함으로써 feature mismatch 문제를 해결함

-> 그래서 VITS의 end-to-end framework를 활용한 text-free VC model인 FreeVC를 제안

FreeVC
- VITS의 reconstruction ability를 활용하고, text annotation 없이 content information을 disentangle하는 방법을 학습
  - 이때 WavLM을 통해 self-supervised learning (SSL) representation을 얻고, bottleneck extractor를 도입하여 SSL representation으로부터 content information을 추출
- Content information을 변경하지 않고 speaker information만을 distort 하여 disentanglement ability를 향상하는 Spectrogram-Resize (SR) based data augmentation을 적용

< Overall of FreeVC >

VITS를 기반으로 하는 one-shot VC model
Spectrogram-resize data augmentation을 채택하여 disentanglement ability를 향상
결과적으로 기존보다 뛰어난 conversion 성능을 달성

2. Method

FreeVC의 backbone은 GAN training으로 augment 된 cVAE인 VITS로부터 inherit 됨
- 이때 기존 VITS와 달리 prior encoder는 text annotation 대신 raw waveform을 input으로 사용함
- Speaker embedding은 speaker encoder로부터 추출되어 one-shot VC를 수행함

- Model Architecture

FreeVC는 prior encoder, posterior encoder, decoder, discriminator, speaker encoder로 구성됨
- 이때 posterior encoder, decoder, discriminator는 VITS의 구조를 따르는 대신, prior encoder, speaker encoder를 다음과 같이 수정하여 사용함
- Prior Encoder
  1. Prior encoder는 WavLM, bottleneck extractor, normalizing flow로 구성됨
  2. WavLM과 bottleneck extractor는 modeling distribution $\mathcal{N}(z';\mu_{\theta},\sigma^{2}_{\theta})$ 형태로 content information을 추출함
    - 이때 WavLM은 raw waveform을 input으로 하여 content, speaker information을 포함하는 1024-dimensional SSL feature $x_{ssl}$을 생성함
    - Bottleneck extractor는 $x_{ssl}$에 포함된 unwanted speaker information을 제거하기 위해 1024-dimensional $x_{ssl}$을 1024보다 작은 $d$-dimensional representation으로 변환함
  3. 해당 dimension gap을 통해 information bottleneck이 발생하므로 생성된 low-dimensional representation이 noise, speaker information과 같은 content-irrelevant information을 제거하도록 할 수 있음
    - 이후 $d$-dim hidden representation은 $2d$-dim hidden representation으로 project 된 다음, $d$-dim $\mu_{\theta}$와 $d$-dim $\sigma_{\theta}$로 split 됨
  4. Speaker embedding $g$에 대한 condition인 normalizing flow는 prior distribution의 complexity를 개선하기 위해 사용됨
    - 이때 VITS를 따라 multiple affine coupling layer를 사용하고, volume preserving을 통해 Jacobian determinant $\left| \det \frac{\partial z'}{\partial z}\right|$가 1이 되도록 함
- Speaker Encoder
  1. 논문에서는 pretrained, non-pretrained의 2가지의 speaker encoder를 사용함
  2. Pretrained speaker encoder는 다양한 speaker가 있는 dataset에서 training 된 speaker verification model을 사용
    - 기존 VC task에서 일반적으로 사용되는 speaker encoder와 같음
  3. Non-pretrained speaker encoder는 다른 module과 함께 scratch로 jointly training 됨
    - Simple LSTM-based architecture를 사용하고, 추출된 content representation이 충분히 clean 하다면 speaker encoder는 missing speaker information을 modeling 할 수 있음

- Training Strategy

SR-based Data Augmentation
- Narrow bottleneck은 content information을 loss 할 수 있고, wide bottleneck은 speaker information을 포함할 수 있음
  - 따라서 논문은 SR-based data augmentation을 통해 source waveform의 speaker information을 distort 하여 clean content information을 추출하는 방법을 학습하도록 함
- SR-based data augmentation은 3-step으로 수행됨:
  1. Waveform $y$로부터 mel-spectrogram $x_{mel}$을 얻음
  2. $x_{mel}$에 vertical SR operation을 수행하여 modified mel-spectrogram $x'_{mel}$을 얻음
  3. Neural vocoder를 사용하여 $x'_{mel}$로부터 waveform $y'$을 reconstruct 함
- Vertical SR operation은 아래 그림과 같이, mel-spectrogram을 horizontal time-axis, vertical frequency bin-axis를 가지는 image로 취급함
  1. 이때 bilinear interpolation을 사용하여 mel-spectrogram을 특정 ratio $r$로 vertically resize 한 다음, resized mel-spectrogram을 original shape로 pad/cut 하는 방식으로 동작
  2. 만약 ratio $r$이 1보다 작으면, squeezed mel-spectrogram을 highest frequency bin value와 Gaussian noise 합으로 padding 하여 pitch가 낮고 formant distance가 가까운 speech를 생성함
  3. Ratio $r$이 1보다 크면, stretched mel-spectrogram의 상단에서 redundant frequency bin을 cut 하여 higher pitch, farther formant distance를 가지는 speech를 생성함
- 해당 augmented speech를 통해 FreeVC는 각 ratio $r$에서 share 되는 unchanged content informaiton을 추출하는 방법을 학습할 수 있음
Training Loss
- Training loss는 cVAE-related loss와 GAN-related loss로 구성됨
  1. 먼저 cVAE-related loss는 target과 예측된 mel-spectrogram 간의 $L1$ distance인 reconstruction loss $\mathcal{L}_{rec}$와 prior distribution $p_{\theta}(z|c)$, prior distribution $q_{\phi}(z|x_{lin})$ 간의 KL-divergence인 $\mathcal{L}_{kl}$로 구성됨:
    (Eq. 1) $q_{\phi}(z|x_{lin})=\mathcal{N}(z;\mu_{\phi},\sigma_{\phi}^{2})$
    (Eq. 2) $p_{\theta}(z|c)=\mathcal{N}(z';\mu_{\theta},\sigma_{\theta}^{2})\left| \det\frac{\partial z'}{\partial z}\right|$
    - Condition $c$는 waveform $y/y'$에 포함된 content informaiton이고, $\mathcal{L}_{kl}$을 최소화함으로써 feature mismatch 문제를 완화할 수 있음
  2. GAN-related loss는 discriminator $D$와 generator $G$에 대한 adversarial loss $\mathcal{L}_{adv}(D), \mathcal{L}_{adv}(G)$와 generator $G$에 대한 feature matching loss $\mathcal{L}_{fm}(G)$로 구성됨
- 결과적으로 얻어지는 FreeVC의 training loss는:
  (Eq. 3) $\mathcal{L}(G)=\mathcal{L}_{rec}+\mathcal{L}_{kl}+\mathcal{L}_{adv}(G)+\mathcal{L}_{fm}(G)$
  (Eq. 4) $\mathcal{L}(D)=\mathcal{L}_{adv}(D)$

- Inference Procedure

VITS는 VC inference를 위해 posterior encoder를 통해 content information을 추출하고 prior encoder에서 normalizing flow를 적용함
- 반면 FreeVC는 inference 중에 WavLM과 bottleneck extractor를 통해 prior encoder에서 content information을 추출함
- 따라서 추출된 content representation은 source speaker embedding 품질에 영향받지 않음

3. Experiments

- Settings

Dataset : VCTK, LibriTTS
Comparisons : VQMIVC, BNE-PPG-VC, YourTTS

- Results

MOS 측면에서 FreeVC가 가장 우수한 성능을 보임

WER, CER 측면에서도 FreeVC가 가장 뛰어남

'Paper > Conversion' 카테고리의 다른 글

[Paper 리뷰] PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts (0)	2024.09.01
[Paper 리뷰] DreamVoice: Text-Guided Voice Conversion (0)	2024.08.31
[Paper 리뷰] StreamVC: Real-Time Low-Latency Voice Conversion (0)	2024.08.27
[Paper 리뷰] S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations (0)	2024.08.25
[Paper 리뷰] MaskCycleGAN-VC: Learning Non-Parallel Voice Conversion with Filling in Frames (0)	2024.08.22

최근에 올라온 글

최근에 달린 댓글

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion

FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion

1. Introduction

2. Method

- Model Architecture

- Training Strategy

- Inference Procedure

3. Experiments

- Settings

- Results

'Paper > Conversion' 카테고리의 다른 글

티스토리툴바