[Paper 리뷰] AdaVocoder: Adaptive Vocoder for Custom Voice

티스토리 뷰

Paper/Vocoder

[Paper 리뷰] AdaVocoder: Adaptive Vocoder for Custom Voice

feVeRin 2024. 4. 5. 09:45

AdaVocoder: Adaptive Vocoder for Custom Voice

Custom voice는 few target recording만을 사용하여 personal 음성 합성을 구축하는 것을 목표로 함
이때 vocoder 학습을 위한 multi-speaker dataset은 확보하기 어렵고, target speaker의 분포는 training dataset의 분포와 항상 mismatch 하게 나타나는 문제점이 있음
AdaVocoder
- Adaptive vocoder를 위해 cross-domain consistency loss를 도입
- Few-shot transfer learning에 대한 GAN-based vocoder의 overfitting 문제를 해결하여 고품질의 custom voice를 얻음
논문 (INTERSPEECH 2022) : Paper Link

1. Introduction

Text-to-Speech (TTS)의 pipeline에서 acoustic model과 vocoder는 custom voice 합성에 큰 영향을 미침
- 먼저 acoustic model 측면에서 custom voice는:
  1. Custom recording은 training data와 prosody, emotion과 같은 acoustic condition이 다르기 때문에 adaptation이 어려움
  2. 이때 adaptation 성능을 위해 fine-tuning을 수행할 수 있지만, 많은 adaptation parameter가 발생하게 되므로 parameter와 합성 품질 간의 trade-off가 발생함
    - AdaSpeech의 경우, Conditional Layer Normalization을 도입하여 adaptive parameter를 줄이면서 우수한 합성 품질을 달성했음
- Vocoder 측면에서 custom voice는:
  1. Acoustic model과 마찬가지로 training data와 custom data 간의 분포 mismatch로 인해 adaptation이 어려움
  2. 특히 vocoder는 size가 작고 terminal side에서 동작하기 때문에 acoustic model과 달리 parameter와 품질 간의 trade-off 관계가 크지 않음
    - 이때 vocoder의 adaptation을 위해 일반적으로 universal vocoder를 활용함
- 이러한 universal vocoder를 구성하기 위해서는 speaker variety의 확보가 가장 중요하지만, 실질적으로 variety speaker corpora를 수집하는 것은 어려움
  - 결과적으로 여전히 vocoder는 source domain과 target domain 간의 차이로 인해 낮은 cutom voice 품질을 보일 수밖에 없음
- 한편으로 vocoder를 fine-tuning 하는 방법을 활용할 수도 있음
  - BUT, few-shot target domain에서 HiFi-GAN과 같은 최신 vocoder를 fine-tuning 하면 overfitting이 쉽게 발생함

-> 그래서 few-shot adaptation에서 vocoder의 overfitting 문제를 방지하는 AdaVocoder를 제안

AdaVocoder
- Cross-domain consistency loss를 활용하여 source의 instance간 relative similarity와 difference를 preserve 함
- 이를 통해 limited instance에서 vocoder를 fine-tuning 함으로써 발생하는 overfitting을 방지 가능

< Overall of AdaVocoder >

Custom voice 작업을 위해 cross-domain consistency loss를 도입한 adaptive vocoder를 제안
결과적으로 10개 이하의 few-shot recording 만으로 기존 GAN-based vocoder를 fine-tuning 하여 우수한 adaptation 성능을 달성

2. Method

Large source dataset $D_{s}$로 training 하여 mel-spectrogram $m\sim p_{m}(m) \subset M$을 waveform $x$로 mapping 하는 source generator $G_{s}$를 얻는다고 하자
- AdaVocoder는 source generator에 대한 weight를 initializing 하고, 이를 small target dataset $D_{t}$에 fitting 함으로써 adapted generator $G_{s\rightarrow t}$를 얻는 것을 목표로 함
- 일반적인 fine-tuning 방식은 다음과 같이 trained generator와 discriminator를 기반으로 GAN training step으로 fine-tuning 함:
  (Eq. 1) $\mathcal{L}_{adv}(G,D)=D(G(m))-D(x), \,\,\, G^{*}_{s\rightarrow t}=\mathbb{E}_{m\sim p_{m}(m), x\sim \mathcal{D}_{t}}\arg \min_{G}\max_{D}\mathcal{L}_{adv}(G,D)$
- 이러한 transfer method는 target dataset size가 1000 sample 이상일 때 잘 동작하므로, few-shot transfer learning에서는 discriminator가 sample을 기억하고 결과적으로 overfitting이 발생하게 됨
  - 따라서 AdaVocoder는 이러한 overfitting 문제를 해결하기 위해 cross-domain consistency loss를 도입

- Cross-domain Distance Consistency

Cross-domain Distance Consistency loss는 이미지 생성에서 diversity 문제를 해결하기 위해 제안되었음
- 음성 합성 작업에 대한 GAN training 역시 유사한 구성을 가지므로, overfitting을 방지하기 위해 pair-wise relative distance를 사용할 수 있음
- 먼저 $N+1$ batch의 mel-spectrogram $\{m_{n}\}^{N}_{0}$을 sampling 하고, feature space에서 pair-wise similarity를 사용하여 각 음성에 대한 $N$-way 확률 분포를 구성함
  1. 이때 $i$-th mel-spectrogram에 대한 source와 adapted generator의 확률 분포는:
    (Eq. 2) $y_{i}^{s,l}=\mathrm{Softmax}\left( \left\{ \mathrm{sim}\left(G_{s}^{l}(m_{i}),G_{s}^{l}(m_{j}) \right)\right\}_{\forall j\neq i}\right)$
    $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, y_{i}^{s\rightarrow t,l}=\mathrm{Softmax}\left( \left\{ \mathrm{sim}\left(G_{s\rightarrow t}^{l}(m_{i}),G_{s\rightarrow t}^{l}(m_{j}) \right)\right\}\right)$
    - $\mathrm{sim}$ : $l$-th layer의 generator activation 간의 cosine similarity
  2. 결과적으로 Cross-domain consistency loss는:
    (Eq. 3) $\mathcal{L}_{dist}(G_{s\rightarrow t}, G_{s})=\mathbb{E}_{\{m_{i}\sim p_{m}(m)\}}\sum_{l,i}D_{KL}\left(y_{i}^{s\rightarrow t, l}|| y_{i}^{s,l}\right)$

- AdaMelGAN

AdaMelGAN의 generator와 discriminator는 training process에 앞선 cross-domain consistency loss가 추가된다는 것을 제외하면, 기존 MelGAN과 동일함
1. Generator
  - MelGAN의 generator는 mel-spectrogram $m$을 input으로 하여 raw waveform $x$를 output 하는 fully convolutional feed-forward network
  - 이때 mel-spectrogram은 $256\times$ lower temporal resolution이므로, MelGAN은 transposed convolutional layer를 stack 하여 input sequence를 upsampling 함
2. Discriminator
  - MelGAN의 discriminator는 generator에 더 나은 guiding signal을 제공하기 위해 waveform의 characteristic을 capture 하는 것을 목표로 함
  - 이를 위해 서로 다른 audio scale에서 동작하는 3개의 discriminator $D1, D2, D3$를 사용하는 multi-scale architecture를 채택함
    - $D1$은 raw audio scale에서, $D2$는 $2\times$ downsampled audio, $D3$는 $4\times$ downsampled audio에서 동작
3. Final Objective
  - MelGAN을 training 하기 위해 다음의 hinge loss GAN objecitve를 사용함:
    (Eq. 4) $\mathcal{L}_{adv}(D_{k},G)=\mathbb{E}_{x}[\min(0,1-D_{k}(x))]+\mathbb{E}_{m}[\min(0,1+D_{k}(G(m)))],\,\, \forall k=1,2,3$
    $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \mathcal{L}_{adv}(G,D)=\mathbb{E}_{m}\left[\sum_{k=1,2,3}-D_{k}(G(m))\right]$
  - 추가적으로 feature matching loss $\mathcal{L}_{fm}(G,D)$를 적용하고, few-shot transfer learning을 위해 앞선 cross-domain consistency loss를 도입함:
    (Eq. 5) $G_{s\rightarrow t}^{*}=\arg\min_{G_{s\rightarrow t}}\max_{D}\mathcal{L}_{adv}(G_{s\rightarrow t},D)+\lambda_{cd}\mathcal{L}_{dist}(G_{s\rightarrow t},G_{s})+\lambda_{fm}\sum_{k=1}^{3}\mathcal{L}_{fm}(G_{s\rightarrow t},D_{k})$
    - $\lambda_{cd}$ : cross-domain consistency loss의 coefficient, $\lambda_{fm}$ : feature matching loss의 coefficient
    - 논문에서는 각각 $10^{3}$, 10으로 설정

- AdaHiFi-GAN

AdaHiFi-GAN 역시 training 중에 cross-domain consistency loss가 추가되는 것을 제외하면 HiFi-GAN과 동일함
- Generator
  1. 앞선 MelGAN과 마찬가지로 fully convolutional neural network를 사용하여 generator를 구축함
  2. 이때 HiFi-GAN은 multi-receptive field fusion module을 활용해 다양한 length의 pattern을 병렬적으로 처리함
- Discriminator
  1. HiFi-GAN은 Multi-Period Discriminator (MPD)와 Multi-Scale Discriminator (MSD) 두 가지 discriminator를 사용함
  2. 특히 MPD는 다양한 period에 대해 input audio를 서로 다르게 capture 함으로써 다양한 implicit structure를 반영할 수 있음
- Final Objective
  1. Generator $G$와 Discriminator $D$에 대한 HiFi-GAN loss는:
    (Eq. 6) $\mathcal{L}_{adv}(D,G)=\mathbb{E}_{(x,m)}[(D(x)-1)^{2}+(D(G(m)))^{2}]$
    $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \mathcal{L}_{adv}(G,D)=\mathbb{E}_{m}[(D(G(m))-1)^{2}]$
    - 이때 추가적으로 mel-spectrogram loss $\mathcal{L}_{mel}(G)$와 feature matching loss $\mathcal{L}_{fm}(G,D)$를 사용하여 generator의 training 효율성을 향상할 수 있음
  2. Few-shot transfer learning을 위해, cross-domain consistency loss를 도입하면:
    (Eq. 7) $\displaystyle G_{s\rightarrow t}^{*}=\arg\min_{G_{s\rightarrow t}}\max_{D}\mathcal{L}_{adv}(G_{s\rightarrow t},D)+\lambda_{cd}\mathcal{L}_{dist}(G_{s\rightarrow t},G_{s})+\lambda_{fm}\mathcal{L}_{fm}(G_{s\rightarrow t},D)+\lambda_{mel}\mathcal{L}_{mel}(G_{s\rightarrow t})$
    - $\lambda_{cd}, \lambda_{fm}, \lambda_{mel}$ : 각각 cross-domain consistency loss, feature matching loss, mel-spectrogram loss의 coefficient
    - 논문에서는 각각 $10^{3}$, 2, 45로 설정

3. Experiments

- Settings

Dataset : AISHELL3, CSMSC
Comparisons : MelGAN, HiFi-GAN

- Results

AISHELL3 dataset에 대해 transfer method를 비교해 보면, 제안하는 방식이 기존 GAN transfer learning 보다 우수한 성능을 보임

마찬가지로 CSMSC dataset에서도 제안하는 transfer method가 우수한 성능을 보임

AdaSpeech를 기반으로 fine-tuning 된 AdaHiFi-GAN을 vocoder로 적용해 보면, 10개의 recording 만으로 우수한 end-to-end custom voice 합성 성능을 달성할 수 있음

'Paper > Vocoder' 카테고리의 다른 글

[Paper 리뷰] BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis (0)	2024.04.14
[Paper 리뷰] FeatherWave: An Efficient High-Fidelity Neural Vocoder with Multi-Band Linear Prediction (0)	2024.04.10
[Paper 리뷰] BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network (0)	2024.04.04
[Paper 리뷰] Parallel WaveGAN: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram (0)	2024.04.01
[Paper 리뷰] BigVGAN: A Universal Neural Vocoder with Large-Scale Training (0)	2024.03.30

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] AdaVocoder: Adaptive Vocoder for Custom Voice

AdaVocoder: Adaptive Vocoder for Custom Voice

1. Introduction

2. Method

- Cross-domain Distance Consistency

- AdaMelGAN

- AdaHiFi-GAN

3. Experiments

- Settings

- Results

'Paper > Vocoder' 카테고리의 다른 글

티스토리툴바