[Paper 리뷰] nVOC-22: A Low Cost Mel Spectrogram Vocoder for Mobile Devices

본문 바로가기 메뉴 바로가기

티스토리 뷰

Paper/Vocoder

[Paper 리뷰] nVOC-22: A Low Cost Mel Spectrogram Vocoder for Mobile Devices

feVeRin 2024. 5. 29. 09:46

nVOC-22: A Low Cost Mel Spectrogram Vocoder for Mobile Devices

Mobile CPU/GPU에서 동작할 수 있는 fully convolutional, non-autoregressive neural vocoder가 필요함
nVOC-22
- Nearest neighbor resize와 separable convolution의 조합을 upsampling block에 적용하여 checkerboarding artifact를 최소화하고 빠른 upsampling을 지원
- 추가적으로 Generative Adversarial Network를 기반으로 training 하여 안정적인 성능을 달성
논문 (ICASSP 2023) : Paper Link

1. Introduction

음성 합성은 navigation, translation 등의 다양한 mobile application에서 활용되고 있음
- 따라서 더 효과적인 service를 위해서는 음성 합성의 latency를 줄여야 함
  - 이때 Tensorflow Lite 등의 mobile inference framework를 통해 GPU와 같은 hardware accelerated backend를 사용할 수 있음
- BUT, 음성 합성 과정에서 vocoder에 해당하는 mel-spectrogram inversion 과정은 mobile device에서 실행하기에 너무 느림
- 한편으로 기존의 WaveGlow나 MelGAN 등은 neural vocoder에서 추론 속도를 성공적으로 개선했지만, 일반적으로 desktop 수준의 CPU/GPU에서만 실행 가능하다는 한계가 있음

-> 그래서 mobile device에서 사용할 수 있는 neural vocoder인 nVOC-22를 제안

nVOC-22
- Parallelizable structure와 mobile friendly operator를 사용하여 mobile inference에 최적화된 vocoder architecture를 설계
- Nearest neighbor resize와 separable convolution의 조합을 upsampling block에 적용하여 checkerboarding artifact를 최소화하고 빠른 upsampling을 지원
- 추가적으로 Generative Adversarial Network (GAN)을 기반으로 training 하여 안정적인 성능을 달성

< Overall of nVOC-22 >

Upsampling block의 추론 속도를 개선한 mobile device 용 neural vocoder
결과적으로 mobile device에서 real-time보다 빠르게 동작하면서 기존 neural vocoder 수준의 합성 품질을 달성

2. Method

nVOC-22는 12.5ms와 128 bin의 frame shift를 가지는 input mel-spectrogram으로부터 24kHz의 output waveform을 생성함
- 이때 generator는 mobile device에서 동작할 수 있도록 효율적으로 설계되어야 함
- 따라서 parallelizable 하고 Tensorflow Lite 등의 mobile GPU로 accelerate 될 수 있는 operator를 사용해야 함
- 추가적으로 작은 모델 size를 위해 parameter 수를 줄이고, mobile CPU의 L2 cache에 stay 할 수 있어야 함

- Generator

nVOC-22의 generator는 MelGAN과 유사한 upsampling, residual stack을 활용하여 recurrence나 autoregression 없이 input mel-spectrogram을 audio waveform으로 변환함
- 이때 network는 channel 수를 줄이면서 timestep 수를 점진적으로 upsampling 하고, last layer에서 한 번에 2개의 sample을 생성한 다음, flatten 하여 24kHz의 output audio signal을 생성함
- BUT, upsampling step은 linear spectrogram에서 checkerboarding artifact를 생성할 수 있으므로, 해당 문제 해결을 위해 다음의 variation을 고려할 수 있음
  1. Transposed convolution
  2. Channel 수를 늘리는 convolution을 수행한 다음, desired timestep $\times$ channel 수로 reshpae 하는 방법
  3. Bilinear image resize operation과 desired channel 수에 대한 convolution을 사용하는 방법
  4. Nearest neighbor image resize operator에 depthwise convolution, $\alpha=0.2$의 Leakey ReLU를 추가하는 방법
- 결과적으로 논문은 위 option들 중, nearest neighbor resize와 depthwise separable convolution을 채택함
  - 추론 속도와 합성 품질 간의 최적의 trade-off를 얻을 수 있기 때문
- 해당 upsampling block 다음에는 WaveNet의 residual block stack이 추가되어 higher degree의 inter-op parallelism을 지원
  - 이때 경험적으로 upsampling block 당 3개의 residual block stack을 사용

Generator Architecture

- Discriminator

Discriminator는 conditioned random window discriminator block의 ensemble로 구성됨
- 아래 그림과 같이, discriminator는 full 24kHz signal과 convolution을 통해 downsample 된 12kHz signal의 2가지 scale에서 실행됨
  1. 이때 audio signal에 대해 6개의 서로 다른 random window를 select 하고 mel frame의 시작 부분에 맞춰 align 함
  2. 이후 해당하는 ground-truth mel-spectrogram conditioning과 함께 각 random window discriminator에 전달됨
    - 여기서 random window 수는 input mel frame 수를 기반으로 empirically choice 됨
- 구조적으로 Random Window Discriminator는 DBlock과 Conditional DBlock으로 구성되고, 각 convolution layer에는 spectral normalization이 추가됨
- 결과적으로 ensemble의 output은 개별 Random Window Discriminator의 output을 summation 하여 얻어짐
  - 한편으로 unconditional Random Window Discriminator와 auxiliary loss는 사용되지 않음

Discriminator Architecture

- Training Loss and Metrics

nVOC-22는 GAN을 활용하여 training 되고, 더 빠른 training 속도를 지원하는 Hinge objective를 채택함
- Discriminator는 다음의 Hinge loss를 최소화하여 최적화됨:
  (Eq. 1) $\mathcal{L}_{D}=\frac{1}{n}\sum_{i=1}^{n}\left[\max(0,D(x_{g})+1)\right]+\frac{1}{n}\sum_{i=1}^{n}\left[\max(0,1-D(x_{r}))\right]$
  - $D(x_{g})$ : 생성된 signal에 대한 discriminator network output, $D(x_{r})$ : real signal에 대한 discriminator output
  - $n$ : batch size
- 비슷하게 generator도 다음의 Hinge GAN objective를 최소화하여 training 됨:
  (Eq. 2) $\mathcal{L}_{G}=-\frac{1}{n}\sum_{i=1}^{n}D(x_{g})$
- 여기서, 보다 편한 training monitoring을 위해 Spectral convergence와 log STFT magnitude distance를 활용함
  1. 먼저 Spectral convergence는:
    (Eq. 3) $\mathcal{SC}=\frac{|| \,|\text{STFT}(x_{r})|-|\text{STFT}(x_{g})| \,||_{F}}{||\, |\text{STFT}(x_{t})| \, ||_{F}}$
    - $|| \cdot ||_{F}$ : Frobenius norm, $x_{r}, x_{g}$ : 각각 real/generated signal
  2. 그리고 log STFT magnitude distance는:
    (Eq. 4) $\mathcal{D}=\frac{|| \, \log|\text{STFT}(x_{r})|-\log |\text{STFT}(x_{g})|\, ||_{1}}{N}$
    - $||\cdot||_{1}$ : $L_{1}$ norm, $N$ : STFT element 수
  3. 경험적으로 1.5M step의 training 중에 spectral convergence는 1.02에서 0.36으로 감소하고, log STFT magnitude distance는 1.83에서 0.83으로 감소하는 것으로 나타남

3. Experiments

- Settings

Dataset : Internal dataset
Comparisons : WaveRNN

- Results

nVOC-22는 24kHz sample rate에 대해 CPU에서는 real-time 보다 20배 빠르게, GPU에서는 65배 빠른 생성이 가능함

Sample 합성 속도 비교

합성 품질 측면에서도 WaveRNN과 비교하여 nVOC-22가 더 나은 성능을 보임

합성 품질 비교

'Paper > Vocoder' 카테고리의 다른 글

[Paper 리뷰] Puffin: Pitch-Synchronous Neural Waveform Generation for Fullband Speech on Modest Devices (0)	2024.06.14
[Paper 리뷰] SiD-WaveFlow: A Low-Resource Vocoder Independent of Prior Knowledge (0)	2024.06.10
[Paper 리뷰] Harmonic WaveGAN: GAN-based Speech Waveform Generation Model with Harmonic Structure Discriminator (0)	2024.05.28
[Paper 리뷰] Basis-MelGAN: Efficient Neural Vocoder based on Audio Decomposition (0)	2024.05.21
[Paper 리뷰] FIRNet: Fundamental Frequency Controllable Fast Neural Vocoder with Trainable Finite Impulse Response Filter (0)	2024.05.20

댓글

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

티스토리툴바