[Paper 리뷰] RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses

티스토리 뷰

Paper/Vocoder

[Paper 리뷰] RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses

feVeRin 2024. 7. 23. 09:34

RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses

Generative Adversarial Network-based waveform generation은 discriminator에 크게 의존함
- 따라서 generation process에 uncertainty가 존재하고 pitch/intensity mismatch가 발생함
RefineGAN
- Robustness, pitch/intensity accuracy를 유지하기 위해 pitch-guided refine architecture를 구성
- 추가적으로 training을 stabilize 하기 위해 multi-scale spectrogram-based loss function을 채택
논문 (INTERSPEECH 2022) : Paper Link

1. Introduction

Realistic waveform을 합성하기 위해 Generative Adversarial Network (GAN)-based vocoder가 자주 활용되고 있음
- 특히 HiFi-GAN은 multi-period discriminator를 활용하고 UnivNet은 linear spectrogram에 기반한 discriminator를 도입하여 음성 품질을 개선함
- BUT, 대부분의 GAN-based vocoder는 band-limited wavefrom을 생성하는데 중점을 둠
  - 특히 generation process가 unstable하고 pitch/intensity mismatch 하므로 합성 품질 저하에 대한 문제가 여전히 남아있음

-> 그래서 higher sample rate의 음성 합성을 지원하는 RefineGAN을 제안

RefineGAN
- Pitch-guided method를 도입하여 full-band mel-spectrogram과 target pitch를 network input으로 사용함으로써 stable, high-resolution synthesis를 지원
- 추가적으로 multi-scale spectrogram-based loss function을 채택해 training을 stabilize

< Overall of RefineGAN >

Pitch-guided architecture에 기반한 GAN-based neural vocoder
결과적으로 seen/unseen dataset 모두에서 기존보다 뛰어난 합성 품질을 달성

2. Method

- Generator

RefineGAN은 encoder-decoder-like architecture를 기반으로 pitch information으로 구성된 signal (speech template)를 waveform으로 refine 함
- Refinement에는 mel-spectrogram을 활용하기 위해 speech template를 intermediate representation으로 encoding 하는 encoder module과 해당 hidden form에서 waveform을 생성하는 decoder module이 포함됨
  1. Encoding/decoding process는 waveform을 mel-spectrogram과 동일한 shape로 downsampling 한 다음, 이를 final waveform으로 upsampling 하는 방식으로 수행됨
  2. 이때 downsmapling은 convolution layer로 수행되고 upsampling은 transposed convolution layer를 활용함
- 추가적으로 pitch information을 활용할 수 있도록 UNet-like strategy를 채택하여 encoder/decoder의 각 block 사이에 cross-connection mechanism을 적용함
  - 이를 위해 서로 다른 kernel size와 dilation size를 가지는 parallel ResBlock을 사용
  - 각 ResBlock은 3개의 sub-block으로 구성되고, 각 sub-block은 2개의 Leaky ReLU와 weight normalized convolution layer를 가짐

- Speech Template

논문은 다음의 과정을 통해 target pitch information으로부터 speech template를 생성함
- 먼저 speech template signal은 target waveform과 동일한 length, sample rate를 가지고, pitch information은 Harvest algorithm을 사용하여 얻어짐
- 다음으로 target signal의 unvoiced part에 대해서는 GAN network의 nosie source를 활용하여 uniform noise를 생성함
- Voiced part의 경우, 아래 그림과 같이 one-sample-long pulse를 생성하고 각 pulse 간의 time은 해당 time의 frequency에 대한 reciprocal로 target pitch로부터 계산됨
- 이후 mel-spectrogram에서 계산된 intensity-like value로부터 pulse value가 정의됨

해당 speech template는 target signal의 각 pulse의 exact position과 precise legnth에 대한 information을 model에 전달하므로 training difficulty를 낮추는데 유용함
- 결과적으로 아래 그림과 같이 RefineGAN은 모든 feature를 처음부터 생성하는 대신, pulse 간의 각 blank에 대해 pulse signal을 적절히 filling 하는 방식을 학습

- Pitch-based Data Augmentation

논문은 작은 dataset에 대해 boader coverage를 활용할 수 있도록 추가적인 training data를 randomly generate 함
- 이를 위해 source recording에서 randomly slice 된 utterance에 대해 random pitch shift를 적용
  - 이때 semitone 단위로 얻어지는 pitch shifting $\zeta$를 lower limit $\zeta_{\min}$에서 upper limit $\zeta_{\max}$내에서 uniformly select 함
- 해당 pitch shifting을 수행하는 동안 artifact 발생을 방지하기 위해 audio signal length를 simulatenously modify 하고 resampling algorithm을 적용함
  - 결과적으로 아래 그림과 같이 rare circumstance에 대해서도 model을 training 할 수 있는 exteremly high/low pitch를 얻을 수 있음

- Multi-param Mel-spectrogram Loss Function

Predicted signal과 ground-truth audio에서 mel-spectrogram을 추출할 때, 동일한 FFT size로는 time-/frequency-domain 모두에서 maximum accuracy representing feature를 얻을 수 없음
- 따라서 논문은 두 signal 간의 difference를 종합적으로 반영하기 위해, individual level에서 distinctive parameter set을 선택하여 mel-spectrogram을 계산함
- 이후 predicted signal과 ground-truth 간의 MSE loss의 평균을 mel-spectrogram loss로 사용:
  (Eq. 1) $\mathcal{L}_{mel}(\mathbf{y},\hat{\mathbf{y}})=\frac{1}{n}\sum_{i=1}^{n} || \log M_{i}(\mathbf{y})-\log M_{i}(\hat{\mathbf{y}}) ||$
  - $n$ : parameter set 수, $M_{i}$ : parameter set $i$를 사용하여 얻어지는 mel-spectrogram

- Loudness Focused Enhancements

Envelope Loss Function
- RefineGAN의 intensity control을 향상하기 위해 1D max-poolinig layer를 사용하여 envelope feature를 추출하고, envelope-based loss function을 추가함
  - 여기서 Envelope curve는 original audio와 reversed polarity audio 모두에 max-pooling layer를 적용하여 얻어짐
- 그러면 envelope의 MAE loss는:
  (Eq. 2) $\mathcal{L}_{envelope}(\mathbf{y},\hat{\mathbf{y}})=| \text{pool}(\mathbf{y})-\text{pool}(\hat{\mathbf{y}})| +| \text{pool} (-\mathbf{y})-\text{pool}(-\hat{\mathbf{y}})]|$
Loudness-based Data Augmentation
- RefineGAN이 utterance의 loudness level에 관계없이 고품질 waveform을 합성할 수 있도록 loudness-based data augmentation을 도입함
  - 해당 방식을 통해 acceptable range 내에서 각 utterance의 loudness level을 randomly adjust 할 수 있음
- $p_{\min}, p_{\max}$는 audio signal의 최소/최대 peak value를 나타내고, $r_{\min}, r_{\max}$는 loudness adjustment의 최소/최대 rate를 나타낸다고 하자
  1. 그러면 다음과 같이 target peak value를 randomly sampling 할 수 있음:
    (Eq. 3) $p'\sim \mathbf{U}[\max (p_{\min},r_{\min}p),\min(p_{\max},r_{\max}p)]$
    - $p$ : original peak value $\max(|y|)$
  2. 결과적으로 audio signal의 gain는 $y'=\frac{y p'}{p}$을 계산하는 것으로 얻어짐
    - $y,y'$ : 각각 augmentation 전/후의 audio signal
- 논문에서는 linear scale의 uniform random sampling을 활용하고, greater loudness diversity로 training 하여 loudness level에 영향받지 않도록 함

- Discriminator

RefineGAN은 다양한 feature coverage를 반영하기 위해 다음 2가지의 discriminator를 사용함
- 먼저 HiFi-GAN의 Multi-Period Discriminator는 44100Hz full-band generation을 위해 $[2,3,5,7,11]$의 period parameter를 사용
- 다른 discriminator로는 UnivNet의 Multi-Resolution Discriminator를 활용

- Training Losses

Generator Loss
- Generator의 경우 앞선 3가지의 loss를 사용하여 final generator loss를 구성함
- 이때 training process를 stabilize 하기 위해 $\mathcal{L}_{mel}$에 weight $\lambda$를 추가:
  (Eq. 4) $\mathcal{L}_{G}=\lambda\mathcal{L}_{mel}(\mathbf{y},G(\mathbf{z},\mathbf{c}))+ \mathcal{L}_{envelope}(\mathbf{y},G(\mathbf{z},\mathbf{c}))+\frac{1}{n_{MPD}} \sum_{i=1}^{n_{MPD}}\mathbb{E}_{\mathbf{z},\mathbf{c}} \left[\log (1+\exp(-D_{MPD,i}(G(\mathbf{z},\mathbf{c}))))\right]$
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,+ \frac{1}{n_{MRD}}\mathbb{E}_{\mathbf{z},\mathbf{c}}\left[ \log(1+\exp(-D_{MRD,i}(M_{i}(G(\mathbf{z},\mathbf{c}))) ))\right]$
Discriminator Loss
- Multi-Period Discriminator (MPD)와 Multi-Resolution Discriminator (MRD) 모두에 대해 서로 다른 parameter와 single Adam optimizer를 사용하는 module의 average value를 사용
- 그러면 다음의 MRD, MPD output에 대한 합이 backpropagation에 대한 loss로 사용됨:
  (Eq. 5) $\mathcal{L}_{MPD}=\frac{1}{n_{MPD}}\sum_{i=1}^{n_{MPD}}\left( \mathbb{E}_{\mathbf{y}}[\log(1+\exp (-D_{MPD,i}(\mathbf{y})))]+\mathbb{E}_{\mathbf{z},\mathbf{c}}[\log(1+\exp( D_{MPD,i}(G(\mathbf{z},\mathbf{c}))))]\right)$
  (Eq. 6) $\mathcal{L}_{MRD}=\frac{1}{n_{MRD}}\sum_{i=1}^{n_{MRD}}\left(\mathbb{E}_{\mathbf{y}}[ \log (1+\exp(-D_{MRD,i}(M_{i}(\mathbf{y}))))]+\mathbb{E}_{\mathbf{z},\mathbf{c}}[ \log (1+\exp(D_{MRD,i}(M_{i}(G(\mathbf{z},\mathbf{c})))))]\right)$
  (Eq. 7) $\mathcal{L}_{D}=\mathcal{L}_{MPD}+\mathcal{L}_{MRD}$

3. Experiments

- Settings

Dataset : AISHELL-3, HiFi-TTS, Att-HACK, JSUT
Comparisons : Griffin-Lim, HiFi-GAN, UnivNet

- Results

전체적인 성능 측면에서 RefineGAN이 가장 뛰어난 품질을 달성함

Unseen data에 대해서도 RefineGAN은 우수한 성능을 보임

Mel-spectrogram 측면에서도 RefineGAN은 ground-truth와 비슷한 output을 합성 가능

'Paper > Vocoder' 카테고리의 다른 글

[Paper 리뷰] QGAN: Low Footprint Quaternion Neural Vocoder for Speech Synthesis (0)	2024.11.03
[Paper 리뷰] QHM-GAN: Neural Vocoder based on Quasi-Harmonic Modeling (0)	2024.10.27
[Paper 리뷰] Bunched LPCNet: Vocoder for Low-cost Neural Text-to-Speech Systems (0)	2024.07.14
[Paper 리뷰] End-to-End LPCNet: A Neural Vocoder with Fully-Differentiable LPC Estimation (0)	2024.07.13
[Paper 리뷰] DFlow: A Generative Model Combining Denoising AutoEncoder and Normalizing Flow for High Fidelity Waveform Generation (0)	2024.07.07

최근에 올라온 글

최근에 달린 댓글

« 2025/01 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses

RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses

1. Introduction

2. Method

- Generator

- Speech Template

- Pitch-based Data Augmentation

- Multi-param Mel-spectrogram Loss Function

- Loudness Focused Enhancements

- Discriminator

- Training Losses

3. Experiments

- Settings

- Results

'Paper > Vocoder' 카테고리의 다른 글

티스토리툴바