[Paper 리뷰] Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder

티스토리 뷰

Paper/Vocoder

[Paper 리뷰] Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder

feVeRin 2023. 11. 13. 11:18

Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder

Unified Source-Filter GAN (USFGAN)은 source filter 이론을 도입하여 높은 음성 품질과 pitch 제어를 가능하게 함
USFGAN은 높은 temporal resolution으로 인해 높은 계산 비용을 가짐
Source-Filter HiFi-GAN
- HiFi-GAN에 source filter 이론을 도입한, 빠르고 pitch 제어가 가능한 neural vocoder
- Source excitation information에 resonance filter를 계층적으로 conditioning
논문 (ICASSP 2023) : Paper Link

1. Introduction

Neural vocoder는 input acoustic feature를 기반으로 raw waveform을 생성하는 deep neural network
- 원하는 intonation과 pitch pattern을 유연하게 생성하기 위해서는 fundemental frequency (F0)의 controllability가 중요
- HiFi-GAN은 가장 많이 사용되는 neural vocoder 구조
  - Low temporal resolution feature를 점진적으로 upsampling 하여 high temporal resolution raw waveform과 매칭시킴
  - HiFi-GAN은 좋은 음성 품질을 보여주지만, F0의 controllability가 떨어짐
HiFi-GAN의 F0 controllability 개선을 위해 Unified Source-Filter GAN (USFGAN)을 활용할 수 있음
- Quasi-Periodic Parallel WaveGAN의 generator를 source excitation network와 resonance filtering network로 분해
- 높은 temporal resolution input으로 인해 느린 합성 속도를 보임

-> 그래서 빠른 음성 합성과 F0 control이 가능한 Source-Filter HiFi-GAN을 제안

Source-Filter HiFi-GAN
- HiFi-GAN에 source-filter 모델링을 도입한 구조
- Source-filter 이론의 pseudo cascade mechanism을 시뮬레이션하기 위한 두 개의 upsampling network
- Source-filter 모델링의 추가적인 계산 비용을 줄이기 위해 HiFi-GAN의 parameter를 pruning

< Overall of Source-Filter HiFi-GAN >

빠른 high-fidelity 합성과 F0 controllability를 확보한 HiFi-GAN 구조
Source network와 Filter network를 통한 upsampling 과정

2. Baseline HiFi-GAN and USFGAN

- HiFi-GAN

Multi-period, Multi-scale discriminator를 활용한 neural vocoder
Generator는 mel-spectrogream을 입력으로 받고 이를 upsampling 함
- Target raw waveform의 temporal resolution과 Multi-receptive field fusion (MRF) module의 transposed convolution을 일치시킴
- MRF module은 여러 개의 residual block으로 구성됨
Generator의 training objective $L_{G}$
: $L_{G}=L_{g,adv}+\lambda_{fm}L_{fm}+\lambda_{mel}L_{mel}$
- adversarial loss $L_{g, adv}$, feature matching loss $L_{fm}$, mel-spectral L1 loss $L_{mel}$
- $\lambda_{fm}$, $\lambda_{mel}$ : balancing hyperparameter

- Unified Source-Filter GAN (USFGAN)

Source excitation regularization loss
- USFGAN은 source network의 출력의 regularization loss를 사용하여 single DNN을 source network와 filter network로 분해
- Linear predictive coding을 사용하여 output source excitation signal을 residual signal로 근사
  : $L_{reg} = E_{x,c} [\frac{1}{N} || log \psi(S) - log \psi(\hat{S}) ||_{1}]$
  - $x, c$ : 각각 ground truth speech, input features
  - $\hat{S}, S$ : 각각 output source excitation signal, residual spectrogram의 spectral magnitude
  - $\psi, N$ : 각각 해당 mel-spectrogram을 spectral magnitude로 변환하는 함수, mel-spectrogram의 차원수
Pitch dependent excitation generation
- F0 extrapolation 성능을 향상하기 위해 Pitch-dependent Dilated Convolution Neural Network (PDCNN)을 활용
  - $F_{s}, f_{t}, d$ : 각각 sampling frequency, time step $t$에서의 F0값, CNN의 constant dilation factor
  - PDCNN에서 CNN의 dilation size는 $f_{t}$에 따라 각 $t$에 대해 동적으로 변화
- Time-variant dilation size $d_{t}$
  : $d_{t} = \left\{\begin{matrix} |E_{t}| \times d \quad if \, E_{t} > 1 \\ 1 \times d \quad else \end{matrix}\right.$
  - $E_{t} = F_{s}/(f_{t}\times a)$ : dense factor $a$에 의해 modulate 된 period length에 대한 proportional value
  - Dense factor는 주어진 sampling frequency $F_{s}$로 표현가능한 최대 frequency를 제어
- 더 높은 F0 controllability를 위해 input으로 sine wave를 사용
  - Sine wave는 periodic information을 제공하여 안정적인 학습과 F0 controllability를 제공함

3. Source-Filter HiFi-GAN

- Generator Architecture

Source excitation generation network
- Downsampling 1D CNN, transposed 1D CNN, quasi-periodic residual block (QP-ResBlock)을 활용해 source network를 구성
  - Input sine wave는 resolution-matched periodic representation을 제공하기 위해 stride CNN을 통과함
- QP-ResBlock은 residual connection을 포함한 Leaky ReLU, PDCNN, 1D CNN으로 구성
- Source excitation signal은 QP-ResBlock의 최종 output을 regularize 하여 사용됨
Resonance filtering network
- Filter network는 source network의 feature map이 transposed CNN의 output에 추가되는 것을 제외하면 HiFi-GAN의 generator와 동일
  - Feature map은 sine embbedding CNN으로 구성된 additional downsampling CNN으로 처리됨
  - 최종적인 음성은 filter network의 output으로 얻어짐
- Source network의 추가적인 계산 비용을 보정하기 위해 MRF module의 hyperparameter를 pruning
  - Kernel size를 ${3,5,7}$로 줄여 합성 속도를 향상
- Downsampling CNN을 통해 QP-ResBlock의 output을 filter network에 공급함으로써 고주파 음성을 생성 가능

- Training Criteria

HiFi-GAN의 feature matching loss를 regularization loss로 대체
: $L_{G} = L_{g,adv}+\lambda_{mel}L_{mel} + \lambda_{reg}L_{reg}$

4. Experiments

- Settings

Dataset : Namine Ritsu's Database
Comparisons : WORLD, hn-uSFGAN, HiFi-GAN

- Objective Evaluation

합성 효율성을 확인하기 위해 real-time factor (RTF)와 parameter 수를 계산
- F0 controllability를 확인하기 위해 log F0 (RMSE), voice/unvoiced decision error rate (V/UV)를 조사
제안된 Source-Filter HiFi-GAN (SiFi-GAN)이 F0 controllability 측면에서 WORLD, hn-uSFGAN과 비슷한 성능을 달성
합성 속도 측면에서 hn-uSFGAN과 비교하면 더 적은 hyperparameter 사용과 RTF 향상을 보임
- HiFi-GAN과 비교하여도 빠른 합성 속도를 보임

- Subjective Evaluation

합성 결과에 대한 Mean Opinion Score (MOS) 평가
- SiFi-GAN이 음성 품질 측면에서도 좋은 성능을 보임
Ablation Study
- QP-ResBlock의 source excitation representation을 downsampling 하지 않고 바로 filter network에 전달한 SiFi-GAN Direct와 SiFi-GAN을 비교
- SiFi-GAN Direct은 고주파 구성요소를 제대로 합성하지 못함
  - Learnable downsampling CNN이 포함된 source network는 temporal resolution에 대한 tractable hierarchical harmonic information을 제공한다고 볼 수 있음

합성된 음성의 spectrogram 비교 SiFi-GAN (왼쪽), SiFi-GAN Direct (오른쪽)

'Paper > Vocoder' 카테고리의 다른 글

[Paper 리뷰] LPCNet: Improving Neural Speech Synthesis Through Linear Prediction (0)	2023.12.15
[Paper 리뷰] APNet: An All-Frame-Level Neural Vocoder Incorporating Direct Prediction of Amplitude and Phase Spectra (0)	2023.12.01
[Paper 리뷰] Multi-Band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech (0)	2023.11.22
[Paper 리뷰] HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis (0)	2023.10.17
[Paper 리뷰] Hierarchical Diffusion Models for Singing Voice Neural Vocoder (0)	2023.09.26

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder

Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder

1. Introduction

2. Baseline HiFi-GAN and USFGAN

- HiFi-GAN

- Unified Source-Filter GAN (USFGAN)

3. Source-Filter HiFi-GAN

- Generator Architecture

- Training Criteria

4. Experiments

- Settings

- Objective Evaluation

- Subjective Evaluation

'Paper > Vocoder' 카테고리의 다른 글

티스토리툴바