[Paper 리뷰] Avocodo: Generative Adversarial Network for Artifact-Free Vocoder

티스토리 뷰

Paper/Vocoder

[Paper 리뷰] Avocodo: Generative Adversarial Network for Artifact-Free Vocoder

feVeRin 2024. 2. 16. 10:56

Avocodo: Generative Adversarial Network for Artifact-Free Vocoder

Generative Adversarial Network (GAN) 기반의 vocoder는 고품질의 음성 합성이 가능함
- 이때 대부분의 speech component는 low-frequency band에 집중되어 있기 때문에 downsampling을 통한 multi-scale analysis를 활용
BUT, multi-scale analysis는 unintended artifact를 발생시킬 가능성이 높음
Avocodo
- Artifact 발생을 줄여 고품질의 합성이 가능한 GAN-based Vocoder
- Collaborative multi-band discriminator와 sub-band discriminator를 활용
- Pseudo Quadrature Mirror Filter-bank를 활용하여 aliasing을 방지하고 downsampled multi-band speech waveform을 얻음
논문 (AAAI 2023) : Paper Link

1. Introduction

Vocoder는 acoustic feature를 waveform으로 변환하는 역할을 수행함
- 특히 GAN-based vocoder는 높은 합성 품질과 빠른 합성 속도를 보임
  - Generator는 random noise나 mel-spectrogram과 같은 input feature를 speech waveform으로 변환
  - Discriminator는 생성된 waveform을 evaluate 하는 역할
- 일반적으로 low-frequency band가 perceptual quality에 큰 영향을 주기 때문에, GAN-based vocoder는 downsampled speech waveform을 evaluate 하는 multi-scale analysis를 활용함
  - 대표적으로, MelGAN은 Multi-Scale Discriminator (MSD)를 활용
  - HiFi-GAN은 Multi-Period Discriminator (MPD)를 도입
- BUT, GAN-based vocoder는 2가지 문제점을 가지고 있음
  1. Upsampling layer로 인해 발생하는 artifact
    - High-frequency band에 위치한 artifact는 noise를 발생시키고, 결과적으로 합성 품질을 저하시킴
  2. Harmonic component에 대한 reproducibility의 저하
    - Fundamental frequency $F_{0}$는 average pooling이나 equally spaced sampling 같은 단순한 downsampling을 적용하면 aliasing으로 인해 inaccurate 해짐
    - 결과적으로 pitch variation이 큰 음성을 합성할 때 perceptual quality를 저하시킴

-> 그래서 artifact 발생을 최소화하고 고품질의 음성 합성이 가능한 GAN-based vocoder인 Avocodo를 제안

Avocodo
- Artifact를 억제하기 위해 두가지의 discriminator를 활용
- Collaborative Multi-Band Discriminator (CoMBD)
  - Multi-scale analysis를 지원하고 upsampling artifact를 억제하는 역할
  - Full-resolution waveform과 intermediate output을 모두 discriminate
- Sub-Band Discriminator (SBD)
  - Frequency-wise decomposed waveform을 discriminating 하는 역할
  - Aliasing을 유발하는 simple downsampling 대신 high stopband attenuation을 갖춘 Pseudo Quadrature Mirror Filter-bank (PQMF)를 도입

< Overall of Avocodo >

Artifact 발생을 줄여 고품질의 합성이 가능한 GAN-based Vocoder
CoMBD와 SBD 두가지의 discriminator를 통해 artifact와 asing을 억제
결과적으로 Avocodo는 객관적, 주관적 평가 모두에서 가장 우수한 합성 품질을 달성

2. Artifacts in GAN-Based Vocoders

- Upsampling Artifacts

GAN-based vocoder는 mel-spectrogram과 같은 input feature의 rate를 waveform의 sampling rate까지 증가시키기 위해 upsampling layer를 활용함
- 대표적으로 사용되는 transposed upsampling layer는 여러 artifact를 발생시킴
  - Tonal artifact는 spectrogram에서 horizontal line으로 나타남
  - Imaging artifact는 high-frequency band에서 관찰되는 mirrored low frequency를 의미함
- For example,
  1. Digital signal processing에서는,
    - Neighboring sample 사이에 0을 insert 하여 low-pass filtering을 통해 upsampling을 수행함
    - Filtering이 없으면 spectrum이 sampling rate에 걸쳐 repeat 되기 때문에 high-frequency band에서 low-frequency component가 나타남
  2. HiFi-GAN도 transposed upsampling을 사용하므로,
    - Imaging artifact와 같은 unintended frequency component가 나타나 high-frequency band에서 distortion이 발생하고 결과적으로 합성 품질이 저하됨
- 이를 해결하기 위해서는 upsampling layer의 구조를 수정해야 하지만, 모델의 complexity가 증가하는 단점이 있음
  -> 따라서 Avocodo는 artifact를 억제하기 위해 upsampling layer를 수정하지 않는 새로운 discriminator와 loss function을 설계

- Aliasing in Downsampling

GAN-based vocoder는 discriminator를 사용하여 downsampled waveform을 evaluate 하고, low-frequency band의 spectral information을 학습함
- 일반적으로 사용되는 average pooling, equally spaced sampling은 band-limited waveform을 얻는데 효과적인 방법임
- BUT, 이러한 방법들을 사용하더라도 downsampled waveform에서 여전히 aliasing이 나타남
  1. Equally spaced sampling의 경우,
    - 제거되어야 할 high-frequency component가 fold-back 되어 나타나고, low-frequency band에서 harmonic frequency component를 distort 함
  2. Average pooling의 경우,
    - 800Hz 이상의 harmonic component를 distort 하는 것으로 나타남
- 결과적으로 downsampling factor가 증가하면 artifact도 증가하므로, 모델이 정확한 waveform을 생성하기 어려움
- 이를 해결하기 위해서는, high stopband attenuation을 갖춘 band-pass filter를 사용한 downsampling 방식이 필요함
  -> 따라서 Avocodo는 digital filter인 PQMF를 채택하여 downsampling 과정에서 harmonics를 보존

3. Method

Avocodo는 single generator와 2개의 discriminator로 구성됨
- Generator는 mel-spectrogram을 input으로 하여 full-resolution waveform과 intermediate output을 generation
- CoMBD는 intermediate output과 함께 full-resolution waveform과 downsampled waveform을 discriminate
  - 이때 PQMF는 full-resolution waveform을 downsample 하는 low-pass filter로 사용됨
- SBD는 PQMF analysis를 통해 얻은 sub-band signal을 discriminate

- Generator

Generator는 구조적으로는 HiFi-GAN generator를 따르고, high-resolution과 intermediate waveform으로 구성된 multi-scale output을 생성함
- 4개의 sub-block으로 구성되고, 그중 3개는 $G_{k} (1 \leq k \leq 3)$으로 full-resolution의 $\frac{1}{2^{3-k}}$에 해당하는 resolution을 가지는 waveform $\hat{x}_{k}$를 생성
  - $\hat{x}_{3}$ : full-resolution waveform, $\hat{x}_{2}, \hat{x}_{1}$ : intermediate output
- 각 sub-block은 Multi-Receptive field Fusion (MRF)와 transposed convolution layer로 구성됨
  - MRF block은 input의 spatial feature를 capture하는 역할
  - 이때 HiFi-GAN과 달리 intermediate output을 얻기 위해 각 sub-block 다음에 projection layer를 추가함

- Collaborative Multi-Band Discriminator (CoMBD)

CoMBD는 generator의 multi-scale output을 discriminate하는 역할
- 이를 위해 서로 다른 resolution에서 waveform을 evaluate 하는 identical sub-model로 구성됨
  - 각 sub-model은 MSD의 discriminator module을 활용함
  - 해당 module은 fully-convolutional layer와 leaky ReLU activation으로 구성
- Avocodo는 CoMBD를 위해 multi-scale structure와 hierarchical structure를 결합
  - 이런 collabortive structure를 통해 generator가 artifact를 줄이고 고품질의 waveform을 합성하는 것을 도움
  1. 특히 multi-scale structure는 full-resolution 뿐만 아니라 downsampled waveform도 discriminate 함
    - 이를 통해 low-frequency band의 spectral feature에 generator가 focus 할 수 있음
  2. Hierarchical structure는 intermediate waveform을 사용하여 generator가 다양한 level의 acoustic property를 학습하도록 함
    - Generator sub-block이 band-limited waveform을 생성하여 sub-block를 균형적으로 exapnd 하고 filtering 함
    - 결과적으로 hierarchical structure를 통해 upsampling artifact의 발생을 억제할 수 있음
- Collaborative structure를 위해 CoMBD$_{1}$, CoMBD$_{2}$와 같은 low-resolution sub-model은 intermediate output $\hat{x}$와 downsampled waveform $\hat{x}'$을 모두 input으로 사용함
  - 이때 각 resolution에 대해 두 input모두 sub-module을 share 함
  - Intermediate output $\hat{x}_{2}$와 downsampled waveform $\hat{x}'_{2}$는 각각 output $p_{2}$와 $p'_{2}$에 대해 CoMBD$_{2}$의 weight를 share 함
  - 해당 weight-sharing process로 인해 2개의 structure를 collaborating 하는데 추가적인 parameter가 필요하지 않음
- 추가적으로 artifact를 줄이고 음성 품질을 더욱 향상하기 위해 differentiable PQMF를 도입하여, aliasing-restricted downsampled waveform을 얻음
  1. PQMF analysis를 통해 full-resolution speech waveform을 $K$개의 sub-band signal $B_{K}$로 decompose
    - $B_{K}$는 length가 $\frac{T}{K}$인 single-band signal $b_{1},... ,b_{K}$로 구성됨
    - $T$ : full-resolution waveform length
  2. 이후 lowest frequency band에 해당하는 첫 번째 sub-band signal $b_{1}$이 select 됨

- Sub-Band Discriminator (SBD)

PQMF analysis를 통해 얻어진 여러 sub-band signal을 discriminate 하기 위해 SBD를 도입
- PQMF는 $n$-th sub-band signal $b_{n}$이 $(n-1)f_{s}/2N$부터 $nf_{s}/2N$까지의 range에 해당하는 frequency information을 포함하도록 함
  - $f_{s}$ : sampling frequency, $N$ : sub-band 수
  - 이를 기반으로 SBD sub-module은 다양한 range의 sub-band signal의 feature를 discriminate
- SBD는 2가지의 sub-module로 구성됨
  1. tSBD는 time axis에 대한 spectral feature를 capture
    - $B_{N}$을 input으로 하여 time-domain convolution을 수행
    - 이때 sub-band range를 diversifying 함으로써 specific frequency range의 characteristic을 학습할 수 있음
    - 다시 말해, tSBD$_{k}$는 특정한 sub-band signal $b_{i_{k}:j_{k}}$를 input으로 사용함
  2. fSBD는 각 sub-band signal 간의 relationship을 capture
    - $M$ channel sub-band를 transpose 한 $B_{M^{T}}$를 input으로 사용
- SBD의 각 sub-module은 sub-band signal을 evaluate 하기 위해 stacked multi-scale dilated convolution bank로 구성됨
  - Dilated convolution bank는 서로 다른 dilation rate를 가지는 convolution layer로 구성됨
  - 특히 SBD architecture는 waveform에 대한 accurate analysis를 위해 다양한 receptive field가 필요하므로, inductive bias를 따름
  - 결과적으로 각 sub-module 마다 서로 다른 dilation factor가 필요
- SBD는 구조적으로 StyleMelGAN의 Filter-Bank Random Window Discriminator (FB-RWD)와 유사함
  - SBD와 FB-RWD 모두 PQMF를 활용한다는 공통점이 있음
  - 대신 SBD의 각 sub-module은 서로 다른 range의 sub-band signal을 evaluate 하고, lower frequency band / whole range frequency band 등에 대한 더 많은 종류의 block을 활용
  - 결과적으로 SBD는 FB-RWD 보다 signal을 더 효과적으로 evaluate 가능

- Training Objectives

GAN Loss
- GAN network를 training 하기 위해 least adversarial objective를 사용
  - 이때 안정적인 학습을 위해 Sigmoid Cross-Entropy term으로 대체
- Multi-scale output을 $V$, downsampled waveform을 $W$라고 했을 때, GAN Loss는:
  (Eq. 1) $V(D_{k};G) = \mathbb{E}_{(x_{k},s)}\left[ (D_{k}(x_{k})-1)^{2}+(D_{k}(\hat{x}_{k}))^{2}\right]$
  (Eq. 2) $V(G;D_{k}) = \mathbb{E}_{s}\left[ (D_{k}(\hat{x}_{k})-1)^{2}\right]$
  (Eq. 3) $W(D_{k};G) = \mathbb{E}_{(x_{k},s)}\left[ (D_{k}(x_{k})-1)^{2}+(D_{k}(\hat{x}'_{k}))^{2}\right]$
  (Eq. 4) $W(G;D_{k}) = \mathbb{E}_{s}\left[ (D_{k}(\hat{x}'_{k})-1)^{2}\right]$
  - $x_{k}$ : $k$-th downsampled ground-truth waveform
  - $s$ : speech representation (mel-spectrogram)
Feature Matching Loss
- Feature matching loss는 GAN training 시에 사용되는 perceptual loss
- Discriminator에서 sub-module의 feature matching loss는 ground-truth와 예측 waveform 간의 $L1$ distance로 계산
- 따라서 Feature Matching Loss는:
  (Eq. 5) $L_{fm}(G;D_{t}) = \mathbb{E}_{(x,s)}\left[ \sum_{t=1}^{T}\frac{1}{N_{t}}||D_{t}(x)-D_{t}(\hat{x})||\right]$
  - $T$ : sub-module의 layer 수
  - $D_{t}$ : $t$-th feature map, $N_{t}$ : $t$-th feature map의 element 수
Reconstruction Loss
- Mel-spectrogram 기반의 reconstruction loss는 waveform 생성 과정의 안정성을 높일 수 있음
- 이때 reconstruction loss는 ground-truth $x$와 예측된 $\hat{x}$ 간의 mel-spectrogram에 대한 $L1$ distance로 계산됨
- 따라서 Reconstruction Loss는:
  (Eq. 6) $L_{spec}(G)=\mathbb{E}_{(x,s)}\left[ ||\phi(x)-\phi(\hat{x})||_{1}\right]$
  - $\phi(\cdot)$ : mel-spectrogram transform function
Final Loss
- 앞선 loss들을 결합한 final loss는:
  (Eq. 7) $L_{D}^{total} =\sum_{p=1}^{P}V(D_{p}^{C};G)+\sum_{p=1}^{P-1}W(D_{p}^{C};G)+\sum_{q=1}^{Q}V(D_{q}^{S};G)$
  (Eq. 8) $L_{G}^{total}=\sum_{p=1}^{P}\left[ V(G;D_{p}^{C})+\lambda_{fm}L_{fm}(G;D_{p}^{C})\right]+\sum_{p=1}^{P}\left[ W(G;D_{p}^{C})+\lambda_{fm}L_{fm}(G;D_{p}^{C})\right]$
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, +\sum_{q=1}^{Q}\left[V(G;D_{q}^{S})+\lambda_{fm}L_{fm}(G;D_{q}^{S})\right]+\lambda_{spec}L_{spec}(G)$
  - $D^{C}_{p}$ : CoMBD의 $p$-th sub-module, $D^{S}_{q}$ : SBD의 $q$-th sub-module
  - $\lambda_{fm}, \lambda_{spec}$ : 각각 feature matching loss, reconstruction loss의 scale
  - 논문에서는 $\lambda_{fm} = 2, \lambda_{spec} = 45$로 설정

4. Experiments

- Settings

Dataset : LJSpeech, VCTK
Comparisons : HiFi-GAN, VocGAN, StyleMelGAN

- Results

MOS 측면에서 Avocodo는 single / unseen speaker 모두에서 뛰어난 합성 품질을 보임
- 특히 unseen speaker에 대해 Avocodo는 artifact 발생을 억제하므로 고품질 waveform을 위한 generalized characteristic을 더 쉽게 학습할 수 있음

정량적인 지표 측면에서도 마찬가지로 Avocodo가 가장 뛰어난 성능을 보임
- 특히 LSD-HF 결과를 보면, Avocodo는 upsampling artifact를 방지하기 때문에 high-frequency band에서 reproducibiliy를 크게 향상됨

Discriminator 측면에서 비교해 보면, 마찬가지로 Avocodo의 discriminator가 더 적은 artifact를 발생시키고 결과적으로 뛰어난 합성 품질로 이어짐

- Analysis on Artifact

Upsampling Artifact
- HiFi-GAN의 경우 transpose convolution으로 인해 tonal, imaging artifact가 나타남
- 그에 비해 Avocodo는 CoMBD의 intermediate upsampling layer에서 artifact를 제거하는 방식을 학습하기 때문에 artifact가 전혀 나타나지 않음

Aliasing
- Aliasing으로 인한 $F_{0}$ distortion을 확인하기 위해 large range의 $F_{0}$를 가지는 dataset으로 GAN vocoder를 학습
  - 이때 low-frequency를 모델링하기 위한 large-scale downsampling으로 인해 incomplete $F_{0}$ reconstruction이 발생함
- 아래 그림과 같이,
  - 일반적으로 downsampling 된 waveform의 harmonic component는 aliasing으로 인해 distortion이 발생함
  - Anti-aliasing PQMF로 downsampling 된 waveform은 $F_{0}$를 유지하는 것으로 나타남
- 결과적으로 distorted waveform으로 합성된 HiFi-GAN의 결과는 750Hz 이상의 $F_{0}$를 reconstruct 하지 못함
  - 그에 비해 Avocodo는 정상적으로 $F_{0}$ contour를 유지 가능

'Paper > Vocoder' 카테고리의 다른 글

[Paper 리뷰] WaveFlow: A Compact Flow-based Model for Raw Audio (0)	2024.02.18
[Paper 리뷰] WaveGrad: Estimating Gradients for Waveform Generation (0)	2024.02.17
[Paper 리뷰] DiffWave: A Versatile Diffusion Model for Audio Synthesis (0)	2024.02.11
[Paper 리뷰] iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform (0)	2024.02.07
[Paper 리뷰] PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior (0)	2024.02.04

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Avocodo: Generative Adversarial Network for Artifact-Free Vocoder

Avocodo: Generative Adversarial Network for Artifact-Free Vocoder

1. Introduction

2. Artifacts in GAN-Based Vocoders

- Upsampling Artifacts

- Aliasing in Downsampling

3. Method

- Generator

- Collaborative Multi-Band Discriminator (CoMBD)

- Sub-Band Discriminator (SBD)

- Training Objectives

4. Experiments

- Settings

- Results

- Analysis on Artifact

'Paper > Vocoder' 카테고리의 다른 글

티스토리툴바