[Paper 리뷰] AGAIN-VC: A One-Shot Voice Conversion Using Activation Guidance and Adaptive Instance Normalization

티스토리 뷰

Paper/Conversion

[Paper 리뷰] AGAIN-VC: A One-Shot Voice Conversion Using Activation Guidance and Adaptive Instance Normalization

feVeRin 2024. 8. 7. 09:16

AGAIN-VC: A One-Shot Voice Conversion Using Activation Guidance and Adaptive Instance Normalization

Voice Conversion은 일반적으로 disentangle-based learning을 사용하여 speaker와 linguistic content를 분리한 다음, speaker information을 target speaker로 변환하는 방식을 활용함
AGAIN-VC
- Activation guidance와 Adaptive Instance Normalization을 도입해 speaker information의 유출을 방지
- Single encoder-decoder로 구성되어 합성 품질과 speaker similarity를 향상
논문 (ICASSP 2021) : Paper Link

1. Introduction

Voice Conversion (VC)는 linguistic content를 유지하면서 source speech를 target speech로 변환하는 것을 목표로 함
- 이를 위해 Generative Adversarial Network (GAN), Flow, Variational AutoEncoder (VAE) 등을 활용할 수 있음
- 특히 추론 시에 source, target speaker가 unseen인 상황을 가정하는 one-shot VC에서는 speaker와 content를 분리하여 conversion을 수행함
  1. 대표적으로 AutoVC는 GE2E loss를 활용한 speaker encoder와 content encoder에 대한 bottleneck을 활용
  2. AdaIN-VC는 Adaptive Instance Normalization (AdaIN)을 활용해 speaker, content를 분리
  3. VQVC는 Vector Quantization (VQ)을 활용
- BUT, one-shot VC는 여전히 speaker encoder가 redundant 하다는 한계가 있음
  - 특히 VQVC는 VQ의 discrete nature로 인해 content information이 손실될 수 있음

-> 그래서 VC에서 합성 품질과 disentangling ability 간의 trade-off를 개선한 AGAIN-VC를 제안

AGAIN-VC
- Speaker, content information을 disentangle 하여 합성 품질을 향상하면서 model size도 크게 줄일 수 있는 single encoder를 활용
- Trade-off를 개선하기 위해 continuous space에서 content embedding을 guiding 하는 information bottleneck으로써 activation function을 활용하는 Activation Guidance를 도입

< Overall of AGAIN-VC >

Single encoder와 activation guidance를 활용한 voice conversion model
결과적으로 기존 2-encoder 방식보다 뛰어난 성능을 달성

2. Method

- System Overview

먼저 $\mathbf{X}\in \mathbb{R}^{K\times T}$를 audio segment의 input mel-spectrogram, $K$를 frame 당 acoustic feature의 frequency bin 수, $T$를 segment duration이라고 하자
- 기존의 AdaIN-VC는 speaker information을 time-invariant 한 global style로 취급하고, content information은 time-varying 한 local style로 취급하여 개별적으로 encoding 함
  - 이때 Instance Normalization (IN)은 feature disentanglement를 위해 사용됨
- 반면 AGAIN-VC는 AdaIN-VC와 달리 speaker, content representation을 추출하기 위해 single encoder를 사용함
  1. 즉, 추가적인 speaker encoder를 구성하는 대신 IN layer에서 계산된 channel-wise mean $\mu$와 channel-wise standard deviation $\sigma$를 speaker embedding에 reuse 함
  2. 이후 VC 성능을 향상하기 위해 Activation Guidance (AG)를 적용
- 이때 AGAIN-VC는 다음의 self-reconstruction loss를 사용:
  (Eq. 1) $\mathcal{L}=||\mathbf{X}-\hat{\mathbf{X}}||_{1}^{1}$
  - $\hat{\mathbf{X}}$ : autoencoder output

- Style Transfer using AdaIN

$\mathbf{Z}\in\mathbb{R}^{K\times T}$인 input $\mathbf{Z}$가 주어지면 IN은 channel-wise mean $\mu$와 standard deviation $\sigma$를 계산함
- 이때 얻어지는 normalized representation은:
  (Eq. 2) $\text{IN}(\mathbf{Z})=\frac{\mathbf{Z}-\mu(\mathbf{Z})}{\sigma(\mathbf{Z})}$
  - $\mu, \sigma$는 time-invariant 하므로 global (speaker) representation으로 볼 수 있음
- 논문은 아래 그림과 같이 각 encoder block에 IN layer를 추가하여 hidden encoded representation에서 global information을 detach 함
  - Detached feature $\mu, \sigma$는 decoding phase 동안 AdaIN layer에서 reuse 됨
- 결과적으로 source, target representation을 각각 $\mathbf{H},\mathbf{Z}$라고 하면 style transfer는 다음의 two-step으로 수행됨:
  1. $\mathbf{Z}$에서 globla feature $\mu, \sigma$를 추출함
  2. $\mathbf{H}, \mu, \sigma$를 AdaIN layer로 전달함:
    (Eq. 3) $\text{AdaIN}(\mathbf{H},\mu(\mathbf{Z}), \sigma(\mathbf{Z})) =\sigma(\mathbf{Z})\text{IN}(\mathbf{H})+\mu(\mathbf{Z})$
    - 해당 과정은 $\mathbf{H}$의 style을 $\mathbf{Z}$로 transfer하면서 $\mathbf{H}$의 content는 그대로 유지함
- 위 과정을 위해 구조적으로는 U-Net architecture를 활용함
  1. Input mel-spectrogram $\mathbf{X}$는 multiple IN layer를 통과해 global (speaker) information을 제거
  2. Skip connection은 각 layer의 speaker embedding $\mu, \sigma$를 decoder block의 AdaIN layer로 전달
  3. 최종적으로 생성된 mel-spectrogram $\hat{\mathbf{X}}$는 (Eq. 1)을 계산하는데 사용됨

- Activation Guidance (AG)

Dimension reduction이나 vector quantization을 적용하면 content embedding $\mathbf{C}$에서 speaker information이 유출될 수 있음
- 따라서 논문은 information bottleneck으로써 activation function을 추가하는 방식을 도입함
- Extra activation function을 사용하면 content embedding range가 restrict 될 수 있음
  1. 이때 ReLU와 같은 activation은 harsh 하므로 embedding의 encoded information을 손상시킬 수 있음
  2. 반면 Sigmoid를 사용하면 reconstruction 성능을 저하하지 않으면서 적절하게 constraint 할 수 있으므로 information bottleneck으로 적합함
- 결과적으로 해당 bottleneck으로 인해 모델은 $\mu,\sigma$로 구성된 speaker embedding으로부터 더 많은 information을 파악하도록 학습됨
  1. 이때 $\mu,\sigma$는 time-invariant 하므로 content information을 반영할 수 없고, global (speaker) information을 더욱 쉽게 학습할 수 있음
  2. 즉, content embedding은 content information만 전달하고 speaker information은 전혀 제공하지 않으므로 information flow를 올바른 방향으로 guide 할 수 있음

3. Experiments

- Settings

Dataset : VCTK
Comparisons : AdaIN-VC

- Results

Effect of Activation Functions
- Channel size가 줄어들수록 reconstruction error가 커지고 speaker classificiation error가 낮아짐
- 특히 AG로 sigmoid를 사용하는 경우, 대부분의 point가 오른쪽 아래로 이동하는 것으로 나타남

Reconstruction Error와 Speaker Classification Accuracy 비교

Different Activation Function
- 다음의 sigmoid function variant와 ReLU, ELU, Tanh를 비교:
  (Eq. 4) $\text{Sigmoid}(x)=\frac{1}{1+\exp (-\alpha x)}$
  - $\alpha$ : hyperparameter
- 전체적으로 $\alpha=0.1$의 sigmoid function을 사용했을 때 가장 우수한 성능을 달성함

Single Encoder
- Single encoder를 사용한 경우와 2개의 independent encoder를 사용한 경우를 비교해 보면
- Sinlge encoder의 경우 2-Enc 보다 30% 작은 크기를 가지지만 거의 동일한 성능을 보임
- 특히 1-Enc-sig의 경우 가장 우수한 성능을 달성했음

Subjective Evaluation
- MOS, Similarity test 측면에서도 AGAIN-VC가 가장 우수한 성능을 달성함

'Paper > Conversion' 카테고리의 다른 글

[Paper 리뷰] ALO-VC: Any-to-Any Low-Latency One-Shot Voice Conversion (0)	2024.08.12
[Paper 리뷰] AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss (0)	2024.08.11
[Paper 리뷰] VQVC: One-Shot Voice Conversion by Vector Quantization (0)	2024.08.10
[Paper 리뷰] VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net Architecture (0)	2024.08.09
[Paper 리뷰] One-Shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization (0)	2024.08.08

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] AGAIN-VC: A One-Shot Voice Conversion Using Activation Guidance and Adaptive Instance Normalization

AGAIN-VC: A One-Shot Voice Conversion Using Activation Guidance and Adaptive Instance Normalization

1. Introduction

2. Method

- System Overview

- Style Transfer using AdaIN

- Activation Guidance (AG)

3. Experiments

- Settings

- Results

'Paper > Conversion' 카테고리의 다른 글

티스토리툴바