[Paper 리뷰] Bunched LPCNet: Vocoder for Low-cost Neural Text-to-Speech Systems

티스토리 뷰

Paper/Vocoder

[Paper 리뷰] Bunched LPCNet: Vocoder for Low-cost Neural Text-to-Speech Systems

feVeRin 2024. 7. 14. 10:29

Bunched LPCNet: Vocoder for Low-cost Neural Text-to-Speech Systems

LPCNet은 linear prediction과 neural network를 결합하여 computational complexity를 크게 낮출 수 있음
Bunched LPCNet
- LPCNet이 추론 당 둘 이상의 audio sample을 생성하도록 하는 sample-bunching
- LPCNet final layer에서 computation을 줄이는 bit-bunching을 도입
논문 (INTERSPEECH 2020) : Paper Link

1. Introduction

LPCNet은 추론 속도와 합성 품질 측면에서 뛰어난 성능을 달성함
- 특히 source-filter model을 기반으로 low-cost linear prediction filter를 도입해 vocal tract response prediction의 부담을 완화
  - 이후 음성을 reconstruct 하기 위해 smaller WaveRNN-style neural network를 활용
- 해당 LPCNet은 small size, low-complexity를 가지기 때문에 on-device vocoder로써 활용할 수 있음
  - BUT, WaveRNN의 autoregressive nature로 인해 previous sample을 condition으로 하나씩 음성 sample을 추론하므로 computational bottleneck이 여전히 존재함

-> 그래서 기존 LPCNet의 computational complexity를 추가적으로 완화한 Bunched LPCNet을 제안

Bunched LPCNet
- LPCNet architecture가 추론 당 2개 이상의 sample을 생성할 수 있도록 Sample Bunching을 도입
- Layer size와 computation을 줄이기 위해 final softmax layer가 2개의 bunch로 segregate 되는 Bit Bunching을 추가

< Overall of Bunched LPCNet >

Sample Bunching과 Bit Bunching을 채택해 기존 LPCNet을 개선한 lightweight neural vocoder
결과적으로 기존 LPCNet의 합성 품질을 유지하면서 추론 속도를 크게 향상

2. LPCNet Overview

LPCNet은 vocal tract response를 모델링하는 all-pole LPC filter ($M$ coefficient)와 excitation signal 예측을 위한 small neural network를 활용해 computational cost를 줄임
- 구조적으로는 input frame당 한 번 실행되는 Frame Rate Network (FRN)과 추론 시 $N$ frame size당 하나의 sample을 생성하는 Sample Rate Network (SRN)을 활용
  - 여기서 대부분의 computational burden은 SRN에 집중되어 있음
- SRN은 2개의 GRU와 excitation의 probability distribution을 모델링하는 softmax layer와 결합된 dual fully-connected (dual FC) layer로 구성됨
  1. Excitation signal $e_{t}$는 해당 distribution에서 sampling 된 다음, LPC filter의 예측 $p_{t}$와 결합되어 audio sample $s_{t}$를 생성함
  2. 이때 current time step에 대한 예측과 함께 previous time step의 excitation, speech sample은 embedded representation으로 SRN에 input 됨
- 한편으로 LPCNet은 weight sparsification을 $GRU_{A}$의 recurrent weight matrix에 적용해 품질 저하 없이 complexity를 줄임
  1. 특히 $GRU_{A}$에서는 input weight matrix $U$의 complexity가 recurrent weight matrix $W$보다 큼
    - 따라서 $p,s,e$의 embedding vector와 $U$의 matrix-vector multiplication이 해당 weight의 pre-computed lookup table을 통해 addition operation으로 변환됨
  2. 결과적으로 해당 sparsification을 통해 LPCNet은 다른 autoregressive (AR) vocoder보다 낮은 complexity를 달성할 수 있음
- BUT, SRN 내에서 두 개의 GRU unit과 dual FC layer는 각각 computation의 85%, 15%를 차지하므로, 해당 block들에 대한 추가적인 computation cost 개선이 필요함

3. Method

- Sample Bunching

Sample bunching은 SRN이 추론당 2개 이상의 sample (bunch)를 생성하여 computational cost를 줄이는 것을 목표로 함
- 이러한 multiple sample generation은 GPU의 parallel 추론을 활용하는 non-autoregressive vocoder에서는 자주 사용되지만, LPCNet과 같은 AR vocoder에서는 preivous output에 대한 의존성으로 인해 적용하기 어려움
- 이때 논문에서는 LPCNet의 autoregressive nature를 유지하면서 bunch size 2 이상의 multiple generation을 수행하기 위해, SRN의 GRU가 sample bunch를 생성하기 충분한 capacity를 가진다는 점을 활용함
  1. 먼저 Bunched LPCNet의 SRN은 bunch의 모든 sample에 대해 GRU layer를 share 하고 bunch의 각 excitation 예측에 대해 individual dual FC layer를 가지도록 구성됨
  2. 여기서 first excitation에 대한 dual FC layer input은 $GRU_{B}$의 output $\hat{e}_{t}\sim p(e_{t}|\mathbf{c})$에 의해서만 condition 됨
  3. 나머지의 경우, embedding feed $\hat{e}_{t+k}\sim p(e_{t+k}|\mathbf{c}, \hat{e}_{t},...,\hat{e}_{t+k-1})$를 통해 bunch 내의 previous excitation에 대해 condition 됨
- SRN의 iteration 수는 bunch size $\mathbf{S}$ 만큼 감소하지만, $GRU_{A}$에 대한 input은 $\mathbf{S}$에 따라 linearly increase 하므로 $GRU_{A}$의 input matrix $U$의 size는 증가함
  - BUT, LPCNet은 sparsification을 통한 lookup table을 지원하므로 해당 complexity 증가는 무시할 수 있음
  - 결과적으로 sample bunching을 통한 computation 절감은 $1/\mathbf{S}$에 비례함

- Bit Bunching

LPCNet은 probability $p(e_{t})$를 계산하기 위해 size 256의 softmax activation이 있는 dual FC layer를 사용함
- 여기서 각 softmax output node는 8-bit $\mu$-law representation의 quantized level에 해당
- Bunched LPCNet에서는 sampling bunching과 함께 추론 속도를 더욱 가속화하기 위해 dual FC layer에 대한 bit bunching을 도입함
  - 이때 8-bit를 2개의 group (higher bit/lower bit bunch)로 분할하면 2개의 smaller output layer가 생성되므로 computational complexity가 증가할 수 있음
- 이를 위해 bit bunching은 dual FC layer에 대한 input information을 변경하지 않고, probability $p(e_{t})$를 예측하기 위한 information을 그대로 유지함
  1. 대신 higher bit bunch와 lower bit bunch를 각각 excitation의 coarse prediction과 fine correction에 mapping 한 다음,
  2. Lower bit의 prediction efficacy를 향상하기 위해 higher bit prediction이 embedding layer를 통해 conditioning input으로 전달되도록 함
- 결과적으로 해당 additional conditioning은 lower bit bunch에 대한 cross-entropy loss를 개선하고 higher bit bunch에 할당되는 bit 수를 선택하는데 도움을 줌
  - 이때 best cross-entropy loss는 $\mathbf{B}=(B_{h},B_{l})=(7,4)$로 분할할 때 달성됨
  - $B_{h}, B_{l}$ : 각각 higher bit, lower bit bunch의 bit 수

한편 excitation signal은 higher, lower bit bunch prediction $\hat{e}^{h}_{t}, \hat{e}^{l}_{t})$을 통해 생성됨
- 즉, $\hat{e}_{t}=\xi\left( \hat{e}_{t}^{h}, \hat{e}_{t}^{l} \right)=2^{B_{l}}\hat{e}_{t}^{h}+\hat{e}_{t}^{l}$
- 이후 생성된 excitation으로부터 audio sample은 아래 [Algorithm 1]을 따라 sample/bit bunching을 사용하여 계산됨

논문에서는 기존 LPCNet을 따라 16-bit PCM value $x$ ($-32768\leq x\leq 32767$)를 represent 하기 위해 $\mu$-law quantization algorithm을 채택함
- 이는 아래와 같이 $B$-bit로 waveform을 represent 할 수 있음:
  (Eq. 1) $y=Q_{B}(x)=\text{sign}(x)\cdot V_{m2}\cdot \frac{\ln (1+s_{1}|x|)}{\ln (V_{m})}, \,\, x=Q_{B}^{-1}(y)=\text{sign}(u)\cdot s_{2}\cdot\left(\exp \frac{\ln (V_{m})|u|}{V_{m2}}-1\right)$
  (Eq. 2) $\text{where}\,\,\, V_{m}=2^{B},\,V_{m2}=2^{B-1},\,u=y-V_{m2},\,s_{1}=\frac{V_{m}-1}{2^{15}},\,s_{2}=\frac{2^{15}}{V_{m} -1},\, V_{m}=w_{s}2^{B}$
- 일반적인 경우 $B=8$로 사용하지만, bit bunching에서는 $B$가 더 큰 경우를 고려할 수도 있음
  1. BUT, $B>9$인 경우 $x$ 값이 0에 가까울 때 quantization step size가 1보다 작아짐
    - e.g.) $B=11, Q_{11}(0)=1024$이고 $Q_{11}(1)=1032$
  2. 결과적으로 quantization level이 under-utilize 되고 sampled quantization value에서 PCM conversion까지의 many-to-one mapping으로 인해 discrepancy가 발생하게 됨
- 논문에서는 위 문제를 해결하기 위해, (Eq. 2)와 같이 mapping function의 slope를 control 하는 $w_{s}$ factor를 추가하여 quantization step이 항상 1보다 크도록 유지함
  - 결과적으로 $B=11$인 경우, $w_{s}$를 0.08로 설정하면 아래 그림과 같이 mapping function이 얻어짐

$\mu$-law Quantization의 Mapping Function

4. Experiments

- Settings

Dataset : English speech dataset
Comparisons : LPCNet

- Results

Complexity
- $\mathbf{S}=1,\mathbf{B}=(8,0)$의 기존 LPCNet과 비교하여 $\mathbf{S}=4, \mathbf{B}=(7,4)$인 Bunched LPCNet은 전체적으로 54.2%의 가속 효과를 얻음
- 즉, 기존 LPCNet과 비교하여 Bunched LPCNet은 2.19배 더 빠르게 동작함

추가적으로 sample bunching은 동일한 validation loss에 대해 더 낮은 complexity를 보임

Quality
- DMOS, MOS 측면에서 Bunched LPCNet은 기존 LPCNet과 큰 차이를 보이지 않음

'Paper > Vocoder' 카테고리의 다른 글

[Paper 리뷰] QHM-GAN: Neural Vocoder based on Quasi-Harmonic Modeling (0)	2024.10.27
[Paper 리뷰] RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses (0)	2024.07.23
[Paper 리뷰] End-to-End LPCNet: A Neural Vocoder with Fully-Differentiable LPC Estimation (0)	2024.07.13
[Paper 리뷰] DFlow: A Generative Model Combining Denoising AutoEncoder and Normalizing Flow for High Fidelity Waveform Generation (0)	2024.07.07
[Paper 리뷰] JenGAN: Stacked Shifted Filters in GAN-based Speech Synthesis (0)	2024.07.03

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Bunched LPCNet: Vocoder for Low-cost Neural Text-to-Speech Systems

Bunched LPCNet: Vocoder for Low-cost Neural Text-to-Speech Systems

1. Introduction

2. LPCNet Overview

3. Method

- Sample Bunching

- Bit Bunching

4. Experiments

- Settings

- Results

'Paper > Vocoder' 카테고리의 다른 글

티스토리툴바