[Paper 리뷰] AutoVocoder: Fast Waveform Generation from a Learned Speech Representation Using Differentiable Digital Signal Processing

티스토리 뷰

Paper/Vocoder

[Paper 리뷰] AutoVocoder: Fast Waveform Generation from a Learned Speech Representation Using Differentiable Digital Signal Processing

feVeRin 2024. 3. 27. 09:51

AutoVocoder: Fast Waveform Generation from a Learned Speech Representation Using Differentiable Digital Signal Processing

Mel-spectrogram은 waveform으로부터 간단하게 추출될 수 있지만, mel-spectrogram에서 waveform을 생성하는 vocoder에는 많은 계산 비용이 필요함
AutoVocoder
- 기존 mel-spectrogram 방식에서 벗어나 inverse STFT의 differentiable implementation을 사용하여 waveform을 생성
- 결과적으로 기존 neural vocoder에 비해 14배 이상의 가속 효과를 달성
논문 (ICASSP 2023) : Paper Link

1. Introduction

Text-to-Speech (TTS)는 일반적으로 input text를 intermediate representation으로 mapping 한 다음, intermediate representation을 waveform으로 변환하는 방식으로 동작함
- 이때, intermediate representation으로써 mel-spectrogram을 주로 활용함
  - BUT, mel-spectrogram의 sampling rate는 waveform 보다 훨씬 낮기 때문에 waveform 생성에서는 사용하기 까다로움
- 이러한 어려움을 해결하기 위해 neural vocoder가 도입되었음
  - 대표적으로 autoregressive 방식은 자연스러운 품질을 달성했지만, 높은 계산 비용 문제를 가짐
  - 한편으로 LPCNet과 같이 signal processing 방식을 도입하거나 HiFi-GAN과 같이 non-autoregressive 방식으로 구성하여 계산 비용을 줄일 수 있음
- Differentiable Digital Signal Processing (DDSP)을 사용하면 vocoding 작업을 더 효율적으로 구성할 수 있음
  - 특히 기존에는 mel-spectrogram을 encoding하는 one-off task에 signal processing을 주로 적용함
  - BUT, 이와 반대로 waveform 합성시에 signal processing을 도입하면 속도를 더욱 가속화할 수 있음

-> 그래서 DDSP와 neural vocoder의 장점을 결합해 효율적인 합성을 수행하는 AutoVocoder를 제안

AutoVocoder
- 먼저 mel-spectrogram에서 spectral magnitude만 represent 하고 phase는 discard 함
- 이후 inverse STFT (iSTFT)와 overlap-add를 사용하여 효율적인 waveform 합성을 수행함

< Overall of AutoVocoder >

빠르고 효율적인 DDSP와 informative representation 생성에 유리한 neural network를 결합
기존 mel-spectrogram을 대체하는 frame-based representation으로부터 iSTFT의 differentiable implementation을 사용하여 waveform을 생성
결과적으로 기존 방법들에 상당한 가속 효과를 달성

2. Method

AutoVocoder는 speech waveform에 대해 train 된 autoencoder 구조
1. 이때 기존의 signal-processing-based mel-spectrogram을 대체하는 representation을 학습하고,
2. Learned representation을 적은 계산 비용으로 waveform으로 decoding 하는 것을 목표로 함

- Encoder and Decoder

먼저 AutoVocoder의 encoder는,
1. STFT의 differentiable implementation을 통해 time-domain을 frequency-domain으로 변환함
2. 결과로 얻어지는 complex spectrum에서 4가지의 spectral component를 파생함:
  - Magnitude, Phase, Real, Imaginary
3. 4개의 spectral component를 stack 되어 각각은 channel로써 처리됨
4. 이후 basic block으로 구성된 purely convolutional residual network로 전달됨
  - 이때 baisc block은 width가 3인 2개의 2D convolution layer와 2D Batch Norm, ReLU로 구성되고,
  - Input/output을 summing 하는 residual connection을 사용함
AutoVocoder architecture에서 사용되는 Residual Net은 11개의 basic block으로 구성됨
- 처음 5개 block은 4개의 input/output channel을 가지고, 가운데 1개의 block은 4개의 input channel과 1개의 output channel, 마지막 5개 block은 1개의 input/output channel을 가짐
- Residual Net 이후 single channel output은 single linear layer로 공급됨
  - 해당 layer는 timestep 당 dimensionality를 $(\textrm{windpowsize})/2+1$에서 representation size로 reduce
- 이때 reduced representation size는 learned representation의 single frame dimensionality이고, 일반적인 mel-spectrogram의 frequency dimension과 유사함
Decoder architecture는 Encoder의 역순으로 구성됨
- 결과적으로 AutoVocoder는 autoregressive component를 전혀 사용하지 않음
- 모든 frame은 network에 의해 한번에 처리되고, waveform을 생성하기 위한 subsequent overlap-add는 differentiable iSTFT로 수행됨

AutoVocoder Architecture (Dashed Box는 Decoder)

- Training Regime

모델은 denoising autoencoder로써 train 됨
- Dropout은 decoder의 robustness를 향상하기 위해 training 중에 embedding에 적용됨
- AutoVocoder의 training을 위해, HiFi-GAN의 loss를 사용함:
  - Mel-spectrogram loss와 2개의 Adversarial loss (Multi-scale / Multi-period discriminator)로 구성
- 추가적으로 phase reconstruction을 위해 time-domain loss를 사용
  - Time-domain loss는 per-sample squared error로써 계산되고, 해당 term의 weight가 낮으면 highly audible phase artifact가 발생함

- Redundant Representations of Complex Numbers

Complex spectrogram의 redundant representation을 encoder에 제공하면, 모델이 magnitude와 phase를 잘 represent 하는 방법을 학습할 수 있음
- AutoVocoder의 경우 magnitude spectrogram이 제공되면, phase를 효율적으로 represent 함
- Decoder에서는 complex spectrogram에 대한 3가지 representation을 고려할 수 있음
  1. Cartesian과 Polar의 경우, network는 real/imaginary 또는 magnitude/phase라는 2개의 output channel을 생성함
  2. 다른 방법으로는 network가 4개의 output channel을 모두 생성한 다음, complex form의 Cartesian mean을 취할 수 있음
- Autoencoding 과정에서 Cartesian output과 Polar output의 magnitude 간의 ratio을 조사했을 때, phase가 Cartesian form으로 더 쉽게 모델링 되는 것으로 나타남
  - 반면 4개의 output channel을 사용하거나 polar output만을 사용하는 경우 품질이 저하되는 것으로 나타남

3. Experiments

- Settings

Dataset : LJSpeech
Comparisons : Girffin-Lim, HiFi-GAN

- Results

Listening Test
- MUSHRA를 통해 합성된 음성을 평가해 보면, Griffin-Lim의 품질이 가장 낮은 것으로 나타남
- AutoVoder는 HiFi-GAN과 유사한 품질의 음성을 생성 가능함

Computational Cost
- 합성 속도 측면에서 AutoVocoder는 기존 모델들에 비해 가장 빠른 real-time factor를 보임
  - Griffin-Lim 보다 5배 빠르고, HiFi-GAN 보다 14배 빠른 합성이 가능
- 특히 WaveRNN과 같은 autoregressive 모델에 비해 autovocoder는 상당한 가속효과를 가짐

'Paper > Vocoder' 카테고리의 다른 글

[Paper 리뷰] Parallel WaveGAN: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram (0)	2024.04.01
[Paper 리뷰] BigVGAN: A Universal Neural Vocoder with Large-Scale Training (0)	2024.03.30
[Paper 리뷰] UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation (0)	2024.03.22
[Paper 리뷰] FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net Encoder with Multiple STFTs (0)	2024.03.21
[Paper 리뷰] SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis (0)	2024.03.15

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] AutoVocoder: Fast Waveform Generation from a Learned Speech Representation Using Differentiable Digital Signal Processing

AutoVocoder: Fast Waveform Generation from a Learned Speech Representation Using Differentiable Digital Signal Processing

1. Introduction

2. Method

- Encoder and Decoder

- Training Regime

- Redundant Representations of Complex Numbers

3. Experiments

- Settings

- Results

'Paper > Vocoder' 카테고리의 다른 글

티스토리툴바