[Paper 리뷰] VocGAN: A High-Fidelity Real-Time Vocoder with Hierarchically-nested Adversarial Network

티스토리 뷰

Paper/Vocoder

[Paper 리뷰] VocGAN: A High-Fidelity Real-Time Vocoder with Hierarchically-nested Adversarial Network

feVeRin 2024. 5. 6. 10:27

VocGAN: A High-Fidelity Real-Time Vocoder with a Hierarchically-nested Adversarial Network

GAN-based vocoder는 real-time 합성이 가능하지만 input mel-spectrogram의 acoustic characteristic과 incosistent 한 waveform을 생성하는 경우가 많음
VocGAN
- GAN-based vocoder의 합성 속도를 유지하면서 output waveform의 품질과 consistency를 개선
- Multi-scale waveform generator와 hierarchically-nested discriminator를 활용해 multiple level의 acoustic property를 학습
- Joint conditional and unconditional objective를 통해 high-resolution 합성을 가능하게 함
논문 (INTERSPEECH 2020) : Paper Link

1. Introduction

Neural vocoder를 통해 high-fidelity의 end-to-end 음성 합성이 가능해짐
- 이때 WaveNet과 같은 autoregressive model은 추론 속도가 상당히 느리므로 non-autoregressive model을 주로 활용함
  - BUT, WaveGlow와 같은 flow-based 방식은 고품질의 합성이 가능하지만 여전히 real-time 합성에 활용하기 어려움
- 한편으로 generative adversarial network (GAN)-based vocoder를 고려할 수도 있음
  - 대표적으로 Parallel WaveGAN은 WaveNet-based generator를 활용했고, MelGAN은 lightweight network를 통해 real-time 합성이 가능함
  - BUT, GAN-based vocoder를 활용하면 빠른 합성이 가능하지만 품질 측면에서는 여전히 개선의 여지가 남아있음
- 특히 MelGAN은 low-frequency component (fundamental frequency $F_{0}$)와 high-frequency component (noise) 모두에서 품질 저하가 발생함
  - 추가적으로 MelGAN은 input mel-spectrogram의 acoustic characteristic과 inconsistent 한 waveform을 생성하는 경우가 많음
- 즉, MelGAN의 network architecture와 objective는 audio signal의 acoustic representation을 학습하기에 적합하지 않음

-> 그래서 MelGAN의 합성 속도를 유지하면서 input/output mel-spectrogram의 consistency를 향상한 GAN-based vocoder인 VocGAN을 제안

VocGAN
- MelGAN의 generator를 확장하여 다양한 scale에 대한 여러 waveform을 output 하고, resolution-specific discriminator를 통해 계산된 adversarial loss로 generator를 training
  - 해당 hierarchical structure를 통해 다양한 level의 acoustic property를 균형적으로 학습하도록 함
- 각 resolution-specific discriminator에 Joint Conditional and Unconditional (JCU) loss를 적용해 high-resolution 합성을 도움
- 추가적으로 STFT loss를 결합해 output 품질을 향상

< Overall of VocGAN >

Multi-scale waveform generator와 hierarchically-nested discriminator를 활용해 multiple level의 acoustic property를 학습
Joint conditional and unconditional objective를 통해 high-resolution 합성을 가능하게 함
결과적으로 빠른 합성 속도를 유지하면서 고품질의 합성이 가능

2. Method

- Baseline Model

VocGAN은 real-time 합성을 위해 MelGAN을 baseline으로 사용하고, 품질 향상을 위해 MelGAN의 structure/objecitve를 수정함
- MelGAN의 generator는 upsampling rate가 각각 $8,8,2,2$인 4개의 upsampling block으로 구성된 fully convolutional feed-forward network로 구성됨
- 각 upsampling block에는 transposed convolution과 3개의 dilated convolution, residual connection으로 이루어진 residual stack이 포함됨
- Generator의 training은 다양한 scale의 output waveform에서 downsampling 된 여러 waveform으로부터 window-based objective를 계산하는 multi-scale discriminator를 사용

- Multi-Scale Waveform Generator

Hierarchically-nested adversarial objective는 high-resolution 합성에 효과적임
- VocGAN에서는 다양한 resolution에서 waveform을 합성하도록 유도하여 intermediate representation을 regularize 함
  - 이를 통해 generator는 high-frequency component 뿐만 아니라 low-frequency component도 효과적으로 합성할 수 있음
  - 결과적으로 raw waveform의 합성 품질 향상으로 이어짐
- 이때 hierarchically-nested objective를 사용하기 위해서는 MelGAN의 generator와 discriminator를 수정해야 함
  1. 그러면 아래 그림과 같이 VocGAN의 수정된 generator는 6개의 upsampling block으로 구성됨
    - 여기서 첫 2개의 upsampling block의 upsampling rate는 4이고 나머지 block은 2
  2. Generator는 final full-resolution 뿐만 아니라 여러 $k\,\, (1\leq k\leq K)$에 대한 downsampled waveform도 side output으로 output 함
    - 해당 waveform의 resolution은 각각 full-resolution의 $\frac{1}{2^{k}}$
    - $K$ : downsampled waveform의 수, 논문에서는 $K=4$로 설정
  3. 이후 $k$ downsampled waveform은 convolution layer를 통해 top-five upsampling block의 output으로 생성됨
  4. 결과적으로 generator로 생성된 waveform은:
    (Eq. 1) $\hat{x}_{0},...,\hat{x}_{K}=G(s)$
    - $s$ : input mel-spectrogram, $\hat{x}_{0}$ : final full-resolution waveform, $\hat{x}_{1},...,\hat{x}_{K}$ : downsampled side waveform
- 추가적으로 input mel-spectrogram에서 directly conditioned intermediate representation을 학습하기 위해, input mel-spectrogram의 각 $2\times$ upsampling block에 skip connection을 추가함
  - 이를 통해 input mel-sepctrogram과 acoustic characteristic의 consistency를 향상할 수 있음

- Hierarchically-nested JCU Discriminator

Hierarchically-nested Structure
- VocGAN의 hierarchically-nested discriminator는 5개의 resolution-specific discriminator로 구성됨
  1. 이때 각 discriminator는 해당하는 resolution의 output waveform이 real/fake인지를 결정함
  2. Hierarchically-nested discriminator는 multi-scale waveform generator로부터 5가지 서로 다른 resolution에 대한 spectrogram-to-waveform mapping을 학습함
    - 이를 통해 generator는 acoustic feature의 low-/high-frequency component 모두에 대한 mapping을 학습할 수 있음
  3. 추가적으로 각 resolution-specific discriminator에는 JCU loss가 적용되고, multi-scale discriminator와 유사한 구조의 multi-scale JCU discriminator를 도입함
- Hierarchically-nested discriminator와 multi-scale discriminator의 차이점은:
  1. Hierarchically-nested discriminator는 generator의 intermediate representation에서 직접 생성된 multiple reduced-resolution sample을 활용함
    - 이를 통해 여러 intermediate layer는 reduced-resolution waveform을 생성하는 방법을 학습할 수 있음
  2. 반면, Multi-scale discriminator는 single full-resolution waveform을 얻음 다음, input waveform을 다양한 sampling rate로 downsampling 한 reduced-resolution을 활용함
  3. 결과적으로 hierarchically-nested discriminator는 generator가 low-/high-frequency feature를 보다 balance 있게 학습할 수 있도록 유도할 수 있음
- 이때 resolution-specific discriminator에 대해 다음의 least-square adversarial loss를 사용함:
  (Eq. 2) $V_{k}(G,D_{k})=\frac{1}{2}\mathbb{E}_{s}\left[D_{k}(\hat{x}_{k})^{2}\right]+\frac{1}{2}\left[(D_{k}(x_{k})-1)^{2}\right]$
  - $x_{k}$ : $k$-downsampled ground-truth waveform, $V_{k}(G,D_{k})$ : $k$-downsampled waveform에 대한 objective function
- Final discriminator는 multi-scale discriminator로 구성되므로 objective $V_{0}(G,D_{0})$는 sub-discriminator의 합으로 정의되고, 이때 discriminator와 generator의 loss는 각각:
  (Eq. 3) $L_{D}(G,D)=\sum_{k=0}^{K}V_{k}(G,D_{k})$
  (Eq. 4) $L_{G}(G,D)=\sum_{k=0}^{K}\frac{1}{2}\mathbb{E}_{s}\left[(D_{k}(\hat{x}_{k})-1)^{2}\right]$
Joint Conditional and Unconditional Loss
- 음성 품질을 더욱 향상하기 위해 JCU loss를 hierarchically-nested adversarial objective에 결합함
- 이때 JCU loss는 conditional, unconditional adversarial loss를 결합하여 구성됨:
  (Eq. 5) $V_{k}^{JCU}(G,D_{k})=\frac{1}{2}\mathbb{E}_{s}\left[D_{k}(\hat{x}_{k})^{2}+D_{k}(\hat{x}_{k},s)^{2}\right]+\frac{1}{2}\mathbb{E}_{(s,x_{k})}\left[(D_{k}(x_{k})-1)^{2}+(D_{k}(x_{k},s)-1)^{2}\right]$
  (Eq. 6) $L_{D}^{JCU}(G,D)=\sum_{k=0}^{K}V_{k}^{JCU}(G,D_{k})$
  (Eq. 7) $L_{G}^{JCU}(G,D)=\sum_{k=0}^{K}\frac{1}{2}\mathbb{E}_{s}\left[(D_{k}(\hat{x}_{k})-1)^{2}+(D_{k}(\hat{x}_{k},s)-1)^{2}\right]$
- Conditional loss는 generator가 input mel-spectrogram의 acoustic feature를 waveform에 더 정확하게 mapping 하도록 유도함
  - 결과적으로 input mel-spectrogram과 output wavefom 간의 discrepancy를 줄일 수 있음
Feature Matching Loss
- 추가적으로 VocGAN은 feature matching loss를 모든 resolution-specific discriminator에 적용함
- 여기서 feature matching loss는 ground-truth waveform과 합성된 waveform에서 계산된 discriminator feature map 간의 $L_{1}$ distance로 정의됨:
  (Eq. 8) $L_{FM}(G,D)=\mathbb{E}_{(s,x)}\left[\sum_{k=0}^{K}\sum_{t=1}^{T_{k}}\frac{1}{N_{t}}|| D_{k}^{(t)}(x_{k})-D_{k}^{(t)}(\hat{x}_{k})||_{1}\right]$
  - $T_{k}$ : $k$-th resolution-specific discriminator의 총 layer 수
  - $N_{t}$ : 각 layer의 element 수
  - 이를 통해 training을 stabilize 할 수 있음

- Multi-Resolution STFT Loss

Adversarial training의 stability를 향상하기 위해 Parallel WaveGAN의 multi-resolution STFT loss를 auxiliary loss로 활용함
- 해당 loss를 통해 training의 수렴 속도를 개선할 수 있고, 이때 auxiliary loss는 adversarial objective와 independent 하게 generator에 사용됨
- 먼저 single STFT loss는 ground-truth와 합성된 full-resolution waveform 간의 frame-level 차이를 계산함
  - 그러면, multi-resolution STFT loss $L_{STFT}$는 FFT size, window size, frame shift 등이 모두 다른 여러 STFT loss의 합으로 정의됨
- 결과적으로 앞선 모든 loss들을 결합하여 얻어지는 VocGAN의 total objective는:
  (Eq. 9) $L_{G}^{total}(G,D) =L_{G}^{JCU}(G,D)+\alpha L_{FM}(G,D) +\beta L_{STFT}(G)$
  - $\alpha=10, \beta=1$

3. Experiments

- Settings

Dataset : KSS, LJSpeech
Comparisons : MelGAN, Parallel WaveGAN

- Results

Ablation Study
- Baseline인 MelGAN에 대해 hierarchically-nested objective와 structure를 적용하면 모든 성능 지표가 개선됨
- JCU loss를 사용하는 경우에도 성능 개선으로 이어짐
- Hierarchically-nested loss와 결합하여 사용하는 경우, STFT loss를 사용하는 것이 성능 개선 측면에서 유용함

합성된 waveform의 $F_{0}$ trajectory를 확인해 보면, VocGAN의 결과가 보다 ground-truth에 가깝게 나타남

$F_{0}$ trajectory 비교 (좌) MelGAN (우) VocGAN

Comparison with Existing Models
- 전체적인 합성 품질 측면에서 VocGAN은 다른 모델들보다 우수한 성능을 보임
- Inference speed 측면에서 VocGAN은 MelGAN 보다 다소 느리지만, Parallel WaveGAN 보다는 훨씬 빠른 합성 속도를 보임

'Paper > Vocoder' 카테고리의 다른 글

[Paper 리뷰] Vocos: Closing the Gap Between Time-domain and Fourier-based Neural Vocoders for High-Quality Audio Synthesis (0)	2024.05.19
[Paper 리뷰] GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model (0)	2024.05.09
[Paper 리뷰] StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization (0)	2024.05.01
[Paper 리뷰] Framewise WaveGAN: High Speed Adversarial Vocoder in Time Domain with Very Low Computational Complexity (0)	2024.04.29
[Paper 리뷰] FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis (0)	2024.04.27

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] VocGAN: A High-Fidelity Real-Time Vocoder with Hierarchically-nested Adversarial Network

VocGAN: A High-Fidelity Real-Time Vocoder with a Hierarchically-nested Adversarial Network

1. Introduction

2. Method

- Baseline Model

- Multi-Scale Waveform Generator

- Hierarchically-nested JCU Discriminator

- Multi-Resolution STFT Loss

3. Experiments

- Settings

- Results

'Paper > Vocoder' 카테고리의 다른 글

티스토리툴바