[Paper 리뷰] FA-GAN: Artifacts-Free and Phase-Aware High-Fidelity GAN-based Vocoder

티스토리 뷰

Paper/Vocoder

[Paper 리뷰] FA-GAN: Artifacts-Free and Phase-Aware High-Fidelity GAN-based Vocoder

feVeRin 2025. 1. 5. 10:31

FA-GAN: Artifacts-Free and Phase-Aware High-Fidelity GAN-based Vocoder

Generative Adversarial Network-based vocoder는 noticeable spectral artifact 문제가 존재함
FA-GAN
- Non-ideal upsampling layer로 인해 발생하는 aliasing artifact를 suppress 하기 위해 generator에 anti-aliased twin deconvolution module을 도입
- Blurring artifact를 완화하고 spectral detail reconstruction을 enrich 하기 위해 phase information modeling을 지원하는 fine-grained multi-resolution real/imaginary loss를 적용
논문 (INTERSPEECH 2024) : Paper Link

1. Introduction

Generative Adversarial Network (GAN)-based vocoder는 almost realistic audio를 합성할 수 있음
- BUT, 여전히 frequency domain에서 ground-truth와 generated sample 간의 차이가 존재함:
  1. Upsampling 과정에서 발생하는 aliasing artifact
  2. Frequency domain에 대한 phase information, spectral detail이 부족하여 발생하는 blurring artifact
- 특히 aliasing artifact는 GAN-based vocoder에서 일반적으로 나타남
  1. High-frequency modeling이 low-dimensional feature를 high-dimensional waveform으로 increasing 하기 위해서는 upsampling layer에 의존해야 하기 때문
  2. 대표적으로 transposed convolution layer는 low-frequency spectrogram에서 high-frequency component를 얻는 데 사용되지만 high-frequency domain에서 aliasing artifact가 종종 발생함
    - 해당 artifact를 해결하기 위해, Avocodo는 Collaborative Multi-band Discriminator (CoMBD)와 Sub-band Discriminator (SBD)를 도입함
    - BigVGAN과 같이 low-pass filter를 활용하여 unwanted high-frequency component를 eliminate 할 수도 있음
  3. BUT, 해당 방식들은 여전히 discriminative capability에 대한 effectiveness가 제한적임
- GAN-specific aliasing artifact 외에도 blurring artifact나 explicit harmonic detail 부족이 발생할 수 있음
  1. 이를 해결하기 위해 generator의 modeling capability를 향상하고 spectrogram을 sharpen 하는 discriminator나 auxiliary loss를 도입할 수 있음
  2. 대표적으로 MelGAN은 Multi-Scale Discriminator (MSD)를 사용하고 HiFi-GAN은 Multi-Period Discriminator (MPD)를 사용함
    - Parallel WaveGAN, UnivNet과 같이 Multi-Resolution Discriminator (MRD)를 활용할 수도 있음
  3. BUT, 해당 방식들은 magnitude information은 fully-leverage 할 수 있지만 phase information은 neglecting 하므로 blurry artifact가 종종 발생함

-> 그래서 artifact-free, phase-aware synthesis를 위한 GAN-based vocoder인 FA-GAN을 제안

FA-GAN
- Aliasing artifact를 해결하기 위해 generator에 anti-aliased twin deconvolution module을 도입
  - 특히 high-frequency area에서 aliasing artifact를 suppress 하기 위해 각 position에서 unwanted overlap을 계산하는 방식으로 transposed convolution structure를 개선
- 추가적으로 phase mismatch 문제를 해결하기 위해 multi-resolution RI loss를 채택하여 blurring artifact를 완화하고 spectral detail을 enrich
  - 이때 phase information과 spectral modeling을 enhance 할 수 있는 real/imaginary part를 활용

< Overall of FA-GAN >

Upsampling artifact, blurring artifact를 해결한 artifact-free, phase-aware GAN-based vocoder
결과적으로 artifact 없이 기존보다 뛰어난 합성 품질을 달성

2. Method

FA-GAN은 anti-aliased generator와 discriminator로 구성됨
- Discriminator는 multi-resolution global-level discriminator와 multi-band local-level discriminator를 포함

- Anti-Aliased Generator

Generator backbone은 resolution upsampling을 위해 transposed convolution을 사용하는 HiFi-GAN을 기반으로 함
- BUT, 해당 구조는 overlapping output으로 인해 checkerboard artifact가 발생할 수 있음
- 따라서 논문은 non-ideal transposed convolution으로 인해 발생하는 artifact를 suppress 하기 위해 twin deconvolution branch로 구성된 upsampling module을 도입함
  1. 먼저 generator의 모든 upsampling layer $\text{TDConv}1, \text{TDConv}2$에 대해 twin transposed convolution structure를 설계함
  2. Twin branch $\text{TDConv}2$는 각 position에서 overlap 정도를 계산하기 위해 original transposed convolution branch $\text{TDConv}1$과 parallel 하게 도입됨
    - Twin branch 간에 element-by-element division을 수행함으로써 upsampling layer의 unexpected artifact를 완화
  3. 추가적으로 $f(x)=x+\sin^{2}(x)$와 같이 정의되는 snake activation function을 활용한 Anti-Aliased Multi-Periodicity (AMP) block을 도입하여 reconstruction에 periodic inductive bias를 제공함

- Global and Local Discriminators

FA-GAN은 blurring artifact를 완화하고 spectral detail을 enrich 하기 위해 multi-resolution global-level discriminator와 multi-band local-level discriminator를 도입함
Multi-Resolution Global-level Discriminator
- Full-band spectrogram structure를 sharpen 하기 위해 global-level discriminator로써 multi-resolution complex-spectrogram discriminator를 채택함
  - Real/imaginary component는 speech quality 향상에 중요하기 때문
- 따라서 speech information을 fully-leverage 하기 위해 real/imaginary spectrogram stack을 global-level discriminator의 input feature로 사용함
  1. 추가적으로 frequency domain에서 fine-grained supervision을 enforce 할 수 있는 multi-resolution RI loss function을 설계
  2. 이때 discriminator는 STFT의 다양한 resolution을 가진 2D linear spectrogram에서 추출한 real/imaginary component에서 동작하는 sub-discriminator로 구성됨
    - $[2048,1024,512]$의 STFT window length와 $[240, 120, 50]$의 hop length를 사용
Multi-bad Local-level Discriminator
- Full-band waveform을 suppressed aliasing이 있는 sub-band signal ($\text{Subband#L}, \text{Subband#M}, \text{Subband #H}$)로 divide 하기 위해 differentiable pesudo quadrature mirror filter (PQMF) bank를 활용함
- 이를 위해 FA-GAN은 3개의 local-level discriminator를 설계하여 다양한 sub-band signal의 discriminative feature를 학습함
  1. 여기서 각 discriminator는 diverse receptive field를 cover 하기 위해 서로 다른 dilation rate를 가진 dilated convolution stack으로 구성됨
  2. 해당 local discriminator를 통해 FA-GAN은 다양한 frequency range의 discriminative feature를 활용하여 spectral detail을 enrich 할 수 있음

- Training Objectives

Training loss는 Multi-resolution RI loss, Adversarial loss, Mel loss, Feature matching loss로 구성됨
Multi-Resolution RI Loss
- UnivNet, Avocodo와 같은 기존 GAN-based vocoder는 합성 품질 향상을 위해 magnitude information은 활용하지만 phase information은 overlook 함
  - BUT, 이 경우 vocoder training 중에 phase information은 implicitly reconstruct 되므로 phase mismatch가 발생할 수 있음
- 한편으로 speech signal은 STFT를 통해 real/imaginary part로 decompose 됨
  1. 특히 phase wrapping issue로 인해 phase feature modeling이 어려운 경우 real/imaginary component를 활용할 수 있음
  2. 따라서 FA-GAN은 real audio $x$의 real/imaginary part와 reconstruction $\hat{x}$ 간의 alignment를 enforce 하는 loss function을 도입함
    - 이를 통해 STFT의 decomposition ability를 fully utilize 하여 richer frequency information을 제공
- 먼저 STFT를 통해 $x,\hat{x}$의 real/imaginary part를 추출하고 $L_{1}$ norm을 통해 spectral regularization을 수행했을 때 loss는:
  (Eq. 1) $\{R_{x},I_{x}\}\leftarrow \text{STFT(x)},\,\,\, \{\hat{R}_{x},\hat{I}_{x}\}\leftarrow \text{STFT}(\hat{x})$
  (Eq. 2) $\mathcal{L}_{RI}(x,\hat{x})=|\hat{R}_{x}-R_{x}|_{1}+|\hat{I}_{x}-I_{x}|_{1} +\left| \sqrt{\hat{R}^{2}_{x}+\hat{I}_{x}^{2}}-|\text{STFT}(x)|\right|_{1}+ \frac{|\text{STFT}(x)-\text{STFT}(\hat{x})|_{F}}{|\text{STFT}(x)|_{F}}$
  - $\text{STFT}(\cdot)$ : Short-Time Fourier Transform으로써 complex spectrogram을 추출하는 역할
  - $R, I$ : 각각 audio sample의 real/imaginary part
- Frequency information의 서로 다른 scale을 modeling 하기 위해, 해당 RI loss를 다양한 analysis parameter (FFT size, frame shift, window size)를 통해 multi-resolution으로 확장하면 multi-resolution RI loss를 얻을 수 있음:
  (Eq. 3) $\mathcal{L}_{MR\text{-}RI}(x,\hat{x})=\frac{1}{M}\sum_{m=1}^{M}\mathcal{L}_{RI}^{(m)}(x,\hat{x})$
Adversarial Loss
- Generator $G$와 discriminator $D$에 대한 GAN loss는:
  (Eq. 4) $\mathcal{L}_{adv}(G;D_{n})=\mathbb{E}_{(x,s)}\left[ (1-D_{n}(x_{n}))^{2}+(D_{n}(y_{n}))^{2}\right],\,\,\, \mathcal{L}_{adv}(D_{n};G)=\mathbb{E}_{(s)}\left[\sum_{n=1}^{N}(1-D_{n}(y_{n}))^{2}\right]$
  - $x, y$ : 각각 full-band ground-truth/generated audio sample
  - $x_{n}, y_{n}$ : 각각 $n$-th sub-band ground-truth audio signal/generated sample
  - $s$ : ground-truth mel-spectrogram, $D_{n}$ : $n$-th discriminator, $N$ : discriminator 수
Final Loss
- FA-GAN의 final loss는:
  (Eq. 5) $\mathcal{L}_{G}=\lambda_{g}\sum_{n=1}^{N}\mathcal{L}_{adv}(G;D_{n})+ \lambda_{RI}\mathcal{L}_{MR\text{-}RI}+\lambda_{mel}\mathcal{L}_{mel}+\lambda_{fm}\mathcal{L}_{fm}$
  (Eq. 6) $\mathcal{L}_{D}=\sum_{n=1}^{N}\mathcal{L}_{adv}(D_{n};G)$
  - $\mathcal{L}_{mel}, \mathcal{L}_{fm}$ : HiFi-GAN의 Mel loss, Feature matching loss
  - $\lambda_{g},\lambda_{RI},\lambda_{fm}$ : scalar coefficient

3. Experiments

- Settings

Dataset : LJSpeech, VCTK
Comparisons : HiFi-GAN, UnivNet, Avocodo, BigVGAN

- Results

전체적으로 FA-GAN이 가장 우수한 합성 품질을 보임

MOS 측면에서도 가장 우수한 성능을 달성함

Artifacts Visualization
- 생성된 spectrogram을 비교해 보면, FA-GAN은 frequency domain에서 fine-grained supervision으로 인해 explicit high-frequency harmonic detail을 가짐

Ablation Study
- Ablation study 측면에서 각 component를 제거하는 경우 성능 저하가 발생함

'Paper > Vocoder' 카테고리의 다른 글

[Paper 리뷰] RFWave: Multi-Band Rectified Flow for Audio Waveform Reconstruction (0)	2025.03.09
[Paper 리뷰] PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation (0)	2025.03.08
[Paper 리뷰] Efficient WaveGlow: An Improved WaveGlow Vocoder with Enhanced Speed (0)	2025.01.01
[Paper 리뷰] QGAN: Low Footprint Quaternion Neural Vocoder for Speech Synthesis (0)	2024.11.03
[Paper 리뷰] QHM-GAN: Neural Vocoder based on Quasi-Harmonic Modeling (0)	2024.10.27

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] FA-GAN: Artifacts-Free and Phase-Aware High-Fidelity GAN-based Vocoder

FA-GAN: Artifacts-Free and Phase-Aware High-Fidelity GAN-based Vocoder

1. Introduction

2. Method

- Anti-Aliased Generator

- Global and Local Discriminators

- Training Objectives

3. Experiments

- Settings

- Results

'Paper > Vocoder' 카테고리의 다른 글

티스토리툴바