[Paper 리뷰] Efficient WaveGlow: An Improved WaveGlow Vocoder with Enhanced Speed

티스토리 뷰

Paper/Vocoder

[Paper 리뷰] Efficient WaveGlow: An Improved WaveGlow Vocoder with Enhanced Speed

feVeRin 2025. 1. 1. 10:27

Efficient WaveGlow: An Improved WaveGlow Vocoder with Enhanced Speed

Efficient neural vocoding을 위해 기존 WaveGlow를 개선할 수 있음
Efficient WaveGlow
- Affine coupling layer와 invertible $1\times 1$ convolution으로 구성된 normalizing flow backbone을 활용
- 기존 WaveNet-style network를 FFTNet-style dilated convolution network로 대체
- Audio, local condition feature에 group convolution을 적용
- Local condition은 각 coupling layer의 transform network layer에서 share 됨
논문 (INTERSPEECH 2020) : Paper Link

1. Introduction

FastSpeech와 같은 text-to-speech (TTS) model은 speech를 합성하기 위해 neural vocoder가 필요함
- 여기서 neural vocoder는 크게 autoregressive vocoder와 non-autoregressive vocoder로 나누어짐
  1. Autoregressive vocoder는 WaveNet과 같이 dilated convolution을 활용하거나 LPCNet과 같이 LSTM network를 활용하여 구성됨
  2. Non-autoregressive vocoder는 MelGAN과 같이 주로 Generative Adversarial Network (GAN) 등을 활용함
    - 특히 autoregressive 방식과 달리 parallel synthesis를 지원할 수 있다는 장점이 있음
- 한편으로 flow-based model을 활용하여 non-autoregressive vocoder를 구성할 수도 있음
  1. 대표적으로 WaveGlow는 Glow-style normalizing flow를 기반으로 speech synthesis를 수행하고 single model, single loss function 만을 사용하여 training을 simplify 함
  2. BUT, WaveGlow는 12개의 coupling block, 12개의 invertible $1\times 1$ convolution layer로 구성되고 각 coupling block은 8개의 dilated convolution layer를 가지므로 memory-constrained 환경에서 활용하기 어려움
    - 특히 CPU-based inference에서 computational expensive 함

-> 그래서 기존 WaveGlow의 computational cost를 개선한 Efficient WaveGlow를 제안

Efficient WaveGlow
- 기존 WaveNet-style transform network 대신 FFTNet-style network를 도입
- Group convolution을 적용하여 transform network의 parameter를 절감
- 각 coupling layer의 모든 transform network layer에 대해 local condition을 sharing

< Overall of Efficient WaveGlow >

WaveGlow의 computational efficiency를 개선한 neural vocoder
결과적으로 기존 WaveGlow 수준의 합성 품질을 유지하면서 parameter 수와 추론 속도를 향상

2. Preliminaries

- Normalizing Flow

Normalizing flow는 invertible mapping sequence를 통해 probability density를 target probability density로 convert 함
- Invertible mapping $f$를 사용하여 distribution $p(z)$를 가지는 random variable $z$를 transform 한다고 하자
- 그러면 resulting random variable $z'=f(z)$는 variable change rule에 따라 다음의 log-probability를 가짐:
  (Eq. 1) $\log p(z')=\log p(z)+\log |\det(\partial z/\partial z')|$
  - $\det(\cdot)$ : Jacobian matrix determinant
- Random variable $x$는 random variable $z_{0}$를 invertible mapping chain으로 successively transforming 하여 얻어짐:
  (Eq. 2) $z_{0}\sim\mathcal{N}(z_{0};0,I)$
  (Eq. 3) $x=f_{K}...f_{2}\circ f_{1}(z_{0})$
  - $\mathcal{N}(z_{0};0,I)$ : zero-mean, unit variance를 가지는 multivariate Gaussian distribution
  - $K$ : invertible mapping 수
  - (Eq. 3)은 invertible mapping sequence $f_{K}(...f_{2}(f_{1}(z_{0})))$에 대한 shorthand를 사용하여 $z_{k}=f_{k}(z_{k-1})$로 나타낼 수 있음
- 결과적으로 variable $x$에 대한 log-likelihood는 variable change rule을 사용하여 directly calculate 됨:
  (Eq. 4) $\log p(x)=\log p(z_{0})+\sum_{k=1}^{K}\log |\det(\partial z_{k-1}/\partial z_{k})|$

- WaveGlow

WaveGlow는 white noise를 speech로 convert 할 수 있는 flow-based generative model
- 이때 각 flow step은 affine coupling layer와 invertible $1 \times 1$ convolution으로 구성됨
- Affine coupling layer에서 input feature $x$는 channel dimension $x_{a}, x_{b}$를 따라 two-halves로 split 됨
  1. 여기서 $x_{a}$는 unchange 된 채로 유지되고 $x_{b}$는 $x_{a}$를 input으로 하는 affine transform에 의해 update 됨:
    (Eq. 5) $x_{a},x_{b}=\text{split}(x)$
    (Eq. 6) $s,t=\text{transform_network}(x_{a},\text{local_condition})$
    (Eq. 7) $y=\text{concat}(x_{a},x_{b}\odot s+t)$
    - $y$ : layer output
  2. Affine coupling layer의 Jacobian matrix는 lower triangle matrix의 log determinant가 diagonal element의 log sum과 같음
    - 이때 affine coupling layer의 transform이 invertible 하므로 transform network도 invertible 함
- Invertible $1\times 1$ convolution은 각 affine coupling layer 다음에 추가됨
  - 각 half에 전달되는 information을 fuse 하고 channel update 문제를 해결하기 위함
- 한편으로 WaveGlow는 8 dilated 1D convolution layer로 구성된 WaveNet-style transform network를 활용함
  1. 이때 각 layer는 width 3인 convolution kernel을 사용하고 각 sample은 left/right 모두에 receptive field를 가짐
  2. 그러면 affine coupling layer $f_{coupling}^{-1}(x)$의 Jacobian matrix에 대한 log-determinant는:
    (Eq. 8) $\log |\det(J(f^{-1}_{coupling}(x)))|=\log |s|=\sum_{c=1}^{C}\log s_{c}$
    - $C$ : channel 수
  3. Invertible $1\times 1$ convolution layer $f^{-1}_{1\times 1 conv}=Wx$의 Jacobian matrix에 대한 log-determinant는:
    (Eq. 9) $\log |\det(J(f_{1\times 1conv}^{-1}(x)))|=\log |\det(W)|$
- 결과적으로 WaveGlow의 training loss는:
  (Eq. 10) $\log p(x)=\log p(z)+\sum_{k=1}^{K}\log |\det(W_{k})|+\sum_{k=1}^{K}\log |s_{k}|$
  - $K$ : total flow step 수

3. Efficient WaveGlow

Efficient WaveGlow (EWG)는 앞선 improved transform network와 Glow의 normalizing flow를 기반으로, 다음 3가지의 modification을 반영함:
1. WaveNet-style transform network를 FFTNet으로 대체
2. Group convolution을 도입하여 model parameter를 절감
3. 각 flow step에서 local condition을 transform network layer 간에 share

- FFTNet-Style Affine Transform Network

FFTNet은 Fast Fourier Transform (FFT) structure를 따르는 network를 활용함
- 먼저 input audio sequence $x_{0},x_{1},...,x_{N-1}$이 주어지면 각 FFTNet layer는 input sequence를 $x_{L},x_{R}$로 clip 하고,
- 각 half에 대해 개별적인 $1\times 1$ convolution kernel을 사용한 다음 result를 summation 함:
  (Eq. 11) $z=W_{L}*x_{L}+W_{R}*x_{R}$
  - $*$ : convolution operator
  - $W_{L},W_{R}$ : $x_{L}, x_{R}$에 대한 $1\times 1$ convolution kernel
- 해당 FFTNet은 receptive field를 늘리기 위해 11 layer를 stack 하고 final layer output을 sample $x_{N}$을 예측함
  1. 특히 WaveNet의 gated activation을 ReLU activation으로 대체하고 각 layer에서 skip output을 제거함
  2. 결과적으로 FFTNet은 kernel width 2의 reversed dilated convolution으로 볼 수 있음
논문에서 FFTNet-style network는 transform network의 computational complexity를 줄이기 위해 사용됨
- 특히 FFTNet의 causal convolution은 width 3의 kernel을 사용하여 symmetrical convolution으로 대체되어 receptive field를 2배로 enlarge함:
  (Eq. 12) $z_{i}=W_{L}*x_{i-d}+W_{M}*x_{i}+W_{R}*x_{i+d}$
  - $W_{L},W_{M}, W_{R}$ : $x_{i-d},x_{i},x_{i+d}$에 대한 $1\times 1$ convolution kernel
  - $d$ : current layer에 대한 dilation, $i$ : layer index
- 기존 FFTNet은 각 half에 대해 separate local condition convolution kernel을 사용하지만, 논문은 각 layer의 local condition에 대해 single $1\times 1$ convolution kernel을 도입함:
  (Eq. 13) $z_{i}=(W_{L}*x_{i-d}+W_{M}*x_{i}+W_{R}*x_{i+d})+V_{i}*h_{i}$
  - $V_{i}$ : $1\times 1$ convolution kernel, $h_{i}$ : $i$-th audio sample의 local condition
- 결과적으로 output sample은 9 sample의 receptive field를 가지고, ReLU function은 $1\times 1$ convolution을 사용하기 전에 dilated convolution에서 $x=\text{ReLU}(1\times 1 \text{conv}(\text{ReLU}(z)))$와 같이 사용됨
  - 추가적으로 gradient vanishing problem을 방지하기 위해 residual connection이 transform network의 각 dilated convolution layer에 적용됨

- Group Convolution

Group convolution을 적용하여 model parameter와 FLOPs를 절감할 수 있음
- 따라서 audio feature convolution과 local condition convolution 모두에 group convolution을 적용함
- 결과적으로 $n$ group을 가지는 group convolution을 도입하면 convolution layer의 FLOPs와 parameter 수를 $\times n$만큼 절감할 수 있음

- Local Condition

Mel-spectrogram은 local condition feature로 사용되고 local condition encoder에 의해 encoding 되어 contextualized feature를 얻음
- 여기서 2가지의 mel-spectrogram encoder를 고려할 수 있음:
  1. BLSTM encoder : Global, bi-directional contextual information이 있는 local condition을 생성
  2. Conv1d encoder : 1D convolution layer를 사용하여 local context information을 추출
- 특히 mel-spectrogram encoder는 frame rate network이고 normalizing flow는 sample rate network이므로 mel-spectrogram encoder output은 sample rate로 upsampling 됨
- 추가적으로 local condition에 대한 $1\times 1$ convolution kernel이 parameter의 상당 부분을 차지하므로 각 flow step에서 transform network layer 간에 local condition을 share 하는 방법을 고려할 수 있음
  1. 따라서 논문은 (Eq. 13)과 동일하게 upsampled local condition을 $1\times 1$ convolution으로 transform 한 다음, transform network의 모든 layer와 share 함
  2. 해당 Shared Local Condition (SLC)를 도입하면 model parameter와 complexity를 크게 절감 가능

4. Experiments

- Settings

Dataset : LJSpeech
Comparisons : WaveGlow

- Results

FLOPs 측면에서 Efficient WaveGlow는 최대 $\times 15$의 parameter 절감이 가능함

추론 속도 측면에서도 기존 대비 $\times 5.3$의 가속효과를 가짐

MOS 측면에서도 기존 WaveGlow 수준의 합성 품질을 유지 가능함

'Paper > Vocoder' 카테고리의 다른 글

[Paper 리뷰] PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation (0)	2025.03.08
[Paper 리뷰] FA-GAN: Artifacts-Free and Phase-Aware High-Fidelity GAN-based Vocoder (0)	2025.01.05
[Paper 리뷰] QGAN: Low Footprint Quaternion Neural Vocoder for Speech Synthesis (0)	2024.11.03
[Paper 리뷰] QHM-GAN: Neural Vocoder based on Quasi-Harmonic Modeling (0)	2024.10.27
[Paper 리뷰] RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses (0)	2024.07.23

최근에 올라온 글

최근에 달린 댓글

« 2025/10 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Efficient WaveGlow: An Improved WaveGlow Vocoder with Enhanced Speed

Efficient WaveGlow: An Improved WaveGlow Vocoder with Enhanced Speed

1. Introduction

2. Preliminaries

- Normalizing Flow

- WaveGlow

3. Efficient WaveGlow

- FFTNet-Style Affine Transform Network

- Group Convolution

- Local Condition

4. Experiments

- Settings

- Results

'Paper > Vocoder' 카테고리의 다른 글

티스토리툴바