[Paper 리뷰] DiffWave: A Versatile Diffusion Model for Audio Synthesis

티스토리 뷰

Paper/Vocoder

[Paper 리뷰] DiffWave: A Versatile Diffusion Model for Audio Synthesis

feVeRin 2024. 2. 11. 11:34

DiffWave: A Versatile Diffusion Model for Audio Synthesis

Conditional/Unconditional waveform generation을 위해 diffusion probabilistic model을 사용할 수 있음
DiffWave
- Non-autoregressive 하고 Markov chain을 통해 white noise signal을 waveform으로 변환하는 모델
  - Data likelihood에 대한 variational bound를 최적화함으로써 학습됨
- Mel-spectrogram에 따라 condition 된 neural vocoding, class-conditional generation, unconditional generation 작업에서 활용가능
논문 (ICLR 2021) : Paper Link

1. Introduction

대부분의 기존 waveform model은 informative local conditioner (mel-spectrogram/aligned linguistic feature)를 사용하여 audio를 합성함
- 이로 인해 autoregressive model은 unconditional 환경에서 made-up word-like sound나 낮은 품질의 sample을 합성하는 경우가 많음
- Diffusion probabilistic model은 Markov chain을 활용하여 istotropic Gaussian과 같은 단순한 분포를 복잡한 분포로 점진적으로 변환할 수 있음
  - Data likelihood가 intractable 하기 때문에 variational lower bound (ELBO)를 최적화하여 diffusion model을 학습함
  - Denoising score matching과 같은 parameterization도 유망한 성능을 보이고 있음
- 특히 diffusion model은 learnable parameter 없이 diffusion process를 통해 training data로부터 whitened latent를 얻을 수 있음
  - 따라서 VAE나 GAN과 달리 diffusion model은 추가적인 network가 필요하지 않음
  -> 결과적으로 2개의 network의 joint training으로 발생하는 posterior collapse나 mode collapse 문제를 방지할 수 있어 고품질 합성에 유리함

-> 그래서 raw audio 합성을 위해 diffusion model을 활용하는 DiffWave를 제안

DiffWave
- Non-auotregressive 구조를 통해 high-dimensional waveform을 병렬로 합성 가능함
- Latent와 data 간의 bijection을 유지해야 하는 flow-based model과 달리 architecture constraint를 impose 하지 않으므로 flexible 함
  - 따라서 high-fidelity의 음성을 합성하는 작은 크기의 vocoder를 구성할 수 있음
- Auxiliary loss 없이 single ELBO-based training objective를 사용

< Overall of DiffWave >

WaveNet을 기반으로 한 feed-forward, bidirectional dilated convolution architecture를 활용
Conditional/Unconditional waveform generation 모두에서 뛰어난 성능을 발휘
Small DiffWave의 경우 2.64M의 parameter 수를 가지고 GPU에서 real-time 보다 5배 빠른 음성 합성이 가능

Diffusion Probabilistic Model의 Diffusion / Reverse Process

2. Diffusion Probabilistic Models

$q_{data}(x_{0})$를 $\mathbb{R}^{L}$의 data 분포라고 하자
- $x_{t} \in \mathbb{R}^{L}, \,\, t = 0,1,..., T$를 same dimension을 가지는 variable sequence라고 하면,
  - $t$ : diffusion step index
  - $L$ : data dimension
- $T$ step의 diffusion model은 diffusion process와 reverse process로 구성됨
Diffusion Process
- Data $x_{0}$에서 latent variable $x_{T}$까지의 fixed Markov chain으로:
  (Eq. 1) $q(x_{1},..., x_{T}|x_{0})= \prod_{t=1}^{T}q(x_{t}|x_{t-1})$
  - $q(x_{t}|x_{t-1})$ 각각은 작은 positive constant $\beta_{t}$에 대해 $\mathcal{N}(x_{t}; \sqrt{1-\beta_{t}}x_{t-1},\beta_{t}I)$로 fix됨
- $q(x_{t}|x_{t-1})$의 function은 $x_{t-1}$의 분포에 작은 Gaussian noise를 추가함
- 결과적으로 diffusion process는 variance schedule $\beta_{1}, ..., \beta_{T}$에 따라 data $x_{0}$를 점진적으로 whitened latent $x_{T}$로 변환하는 과정
Reverse Process
- $\theta$에 의해 parameterize된 $x_{T}$에서 $x_{0}$까지의 Markov chain으로:
  (Eq. 2) $p_{latent}(x_{T}) =\mathcal{N}(0, I), \,\, p_{\theta}(x_{0},...,x_{T-1}|x_{T})=\prod_{t=1}^{T}p_{\theta}(x_{t-1}|x_{t}) $
  - $p_{latent}(x_{T})$ : isotropic Gaussian
- Transition probability $p_{\theta}(x_{t-1}|x_{t})$는 shared parameter $\theta$를 통해 $\mathcal{N}(x_{t-1};\mu_{\theta}(x_{t},t),\sigma_{\theta}(x_{t},t)^{2}I)$로 parameterize됨
  - $\mu_{\theta}, \sigma_{\theta}$ 모두 diffusion step $t\in \mathbb{N}$과 variable $x_{t}\in \mathbb{R}^{L}$ 2가지를 input으로 취함
  - $\mu_{\theta}$는 $L$-dimensional vector를 평균으로 output하고, $\sigma_{\theta}$는 real number를 표준편차로 output
- 결과적으로 $p_{\theta}(x_{t-1}|x_{t})$는 diffusion process에서 추가된 Gaussian noise를 denoising하는 것을 목표로 함
Sampling
- Reverse process가 주어졌을 때 생성 과정은,
  1. $x_{T} \sim \mathcal{N}(0,I)$를 sampling한 다음,
  2. $x_{t-1} \sim p_{\theta}(x_{t-1}|x_{t}), \,\, t = T, T-1, ..., 1$을 sampling하는 것
- 이때 output $x_{0}$는 sampled data가 됨
Training
- Likelihood $p_{\theta}(x_{0}) = \int p_{\theta}(x_{0},...,x_{T-1}|x_{T})\cdot p_{latent}(x_{T}) dx_{1:T}$는 intractable함
- 따라서 diffusion model은 variation lower bound (ELBO)를 최대화하는 방식으로 학습됨:
  (Eq. 3) $\mathbb{E}_{q_{data}(x_{0})}\log p_{\theta}(x_{0}) = \mathbb{E}_{q_{data}(x_{0})} \log \mathbb{E}_{q(x_{1},...,x_{T}|x_{0})} \left[ \frac{p_{\theta}(x_{0},...,x_{T-1}|x_{T})\times p_{latent}(x_{T})}{q(x_{1},..., x_{T}|x_{0})}\right]$
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \geq \mathbb{E}_{q(x_{0},...,x_{T})} \log \frac{p_{\theta}(x_{0},...,x_{T-1}|x_{T})\times p_{latent}(x_{T})}{q(x_{1},...,x_{T}|x_{0})} := ELBO$
- 이때 diffusion model의 ELBO는 특정한 parameterization하에서 closed-form으로 계산될 수 있음
  - Langevin dynamics과 denoising score matching을 연결하여 parameterization하는 방식
  - 이를 통해 계산을 빠르게하고 고분산의 Monte Carlo 추정을 방지할 수 있음
- Parameterization을 위해,
  1. 먼저 diffusion process에서 variance schedule $\{ \beta_{t} \}_{t=1}^{T}$를 기반으로 몇가지 constant를 정의하면:
    (Eq. 4) $\alpha_{t}=1-\beta_{t}, \bar{\alpha}_{t}=\prod_{s=1}^{t}\alpha_{s}, \tilde{\beta}=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\beta_{t}\,\,\, \textup{for} \,\, t>1 \,\, \textup{and} \,\, \tilde{\beta}_{1}=\beta_{1}$
  2. 그러면 $\mu_{\theta}, \sigma_{\theta}$에 대한 parameterization은:
    (Eq. 5) $\mu_{\theta}(x_{t},t) = \frac{1}{\sqrt{\alpha_{t}}} \left( x_{t} - \frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\epsilon (x_{t},t)\right), \,\, \textup{and} \,\, \sigma_{\theta}(x_{t},t) = \tilde{\beta}_{t}^{\frac{1}{2}}$
    - $\epsilon_{\theta} : \mathbb{R}^{L} \times \mathbb{N} \rightarrow \mathbb{R}^{L}$ : $x_{t}$와 diffusion step $t$를 input으로하는 neural network
    - $\sigma_{\theta}(x_{t}, t)$ : 위 parameterization 하에서 모든 step $t$에 대해 constant $\tilde{\beta}_{t}^{\frac{1}{2}}$로 fix됨
  3. 따라서 다음의 proposition은 ELBO에 대한 closed-form expression을 제공함:
    
    [Proposition. 1] Fixed schedule $\{ \beta \}_{t=1}^{T}$가 주어지고, $\epsilon \sim \mathcal{N}(0,I)$, $x_{0} \sim q_{data}$라 하자. 그러면 some constant $c, \kappa_{t}$에 대해 (Eq. 5)의 parameterization 하에서:
    (Eq. 6) $-ELBO = c + \sum_{t=1}^{T}\kappa_{t}\mathbb{E}_{x_{0},\epsilon}|| \epsilon-\epsilon_{\theta}(\sqrt{\bar{\alpha}_{t}}x_{0} + \sqrt{1-\bar{\alpha}_{t}}\epsilon, t)||_{2}^{2}$
    이때 $t>1$이면 $\kappa_{t}= \frac{\beta_{t}}{2\alpha_{t}(1-\bar{\alpha}_{t-1})}$이고, $\kappa_{1} = \frac{1}{2\alpha_{1}}$이다.
    
    - 이때 $c$는 최적화와 irrelevant하므로 ELBO를 closed-form을 가지는 tractable Gaussian 분포 간의 KL divergence 합으로 확장할 수 있음
  4. 추가적으로 아래와 같은 ELBO의 unweighted variant를 최소화함으로써 생성 품질을 향상할 수 있고, 논문은 이를 training objective로 사용:
    (Eq. 7) $\min_{\theta} L_{unweighted}(\theta) = \mathbb{E}_{x_{0},\epsilon,t}|| \epsilon - \epsilon_{\theta}(\sqrt{\bar{\alpha}_{t}}x_{0} + \sqrt{1-\bar{\alpha}_{t}}\epsilon, t)||_{2}^{2}$
    - $t$는 $1, ..., T$에서 uniform하게 얻어짐
Fast Sampling
- 아래의 [Algorithm 1]으로 학습된 모델이 주어지면, 가장 효과적인 denoising step은 $t=0$ 근처에서 나타남
- 따라서 학습시 사용되는 $T$ 보다 $T_{infer}$가 더 적은 denoising step을 가지는 fast sampling을 설계할 수 있음
  - Variance schedule을 사용하여 $T$-step reverse process를 $T_{infer}$-step reverse process로 collapse

Diffusion Probabilistic Model의 Training / Sampling Algorithm

3. DiffWave Architecture

DiffWave의 architecutre는,
- Autoregressive constraint가 없으므로 WaveNet과 bidirectional dilated convolution을 기반으로 (Eq. 5)에 해당하는 network $\epsilon_{\theta} : \mathbb{R}^{L} \times \mathbb{N} \rightarrow \mathbb{R}^{L}$을 구축
  - 이때 network는 non-autoregressive 하므로 latent $x_{T}$에서 length $L$의 audio $x_{0}$를 생성하려면 $T$ round의 forward propagation이 필요 ($T$는 waveform length $L$보다 작음)
- 이를 위해 network는 residual channel $C$를 가지는 $N$개의 residual layer stack으로 구성됨
  - 각 layer는 $m$개의 block으로 group 되고 각 block은 $n = \frac{N}{m}$개의 layer를 가짐
- 이때 DiffWave는 각 layer에서 kernel size가 3인 bidirectional dilated convolution (Bi-DilConv)를 사용
  - Dilation은 각 block 내의 각 layer에서 2배씩 증가함 ($[1,2,4,...,2^{n-1}]$)
  - 이후 WaveNet과 같이 모든 residual layer의 skip connection을 summation 함

- Diffusion-Step Embedding

서로 다른 $t$에 대해 서로 다른 $\epsilon_{\theta}(\cdot, t)$를 output 할 수 있어야 하므로 diffusion step $t$를 input으로 포함하는 것이 중요함
- 따라서 128-dimensional encoding vector를 각 $t$에 대해 적용:
  (Eq. 8) $t_{embedding} = \left[ \sin \left( 10^{\frac{0\times4}{63}}t\right), ..., \sin \left(10^{\frac{63\times4}{63}}t\right), \cos \left(10^{\frac{0\times4}{63}}t\right), ..., \cos \left(10^{\frac{63\times4}{63}}t\right) \right]$
- 이후 encoding에 3개의 fully-connected (FC) layer를 적용
  - 이때 처음 2개의 FC layer는 모든 residual layer에서 parameter를 share 함
  - 마지막 residual-layer-specific FC는 두 번째 FC의 output을 $C$-dimensional embedding vector에 mapping
- 최종적으로 해당 embedding vector를 length에 대해 broadcast 하고 모든 residual layer의 input에 추가함

- Conditional Generation

Local Conditioner
- Neural vocoder는 aligned linguistic feature, mel-spectrogram, hidden state 등으로 condition 된 waveform을 합성할 수 있음
- 논문에서는 DiffWave를 mel-spectrogram 기반의 neural vocoder로 구성함
  - 이를 위해 transposed 2D convolution을 통해 mel-spectrogram을 waveform과 동일한 length로 upsampling
  - 이후 layer-specific Conv $1\times 1$이 mel-band를 $2C$ channel로 mapping 하고
  - 각 residual layer의 dilated convolution에 대한 bias term으로 conditioner를 추가
Global Conditioner
- Conditional information은 global discrete label로 제공됨 (speaker ID, word ID 등)
- 이를 위해 DiffWave는 128-dimension의 shared embedding $d_{label}$를 사용
  - 각 residual layer에서 layer-specific Conv $1\times1$을 적용하여 $d_{label}$을 $2C$ channel에 mapping 하고
  - 각 residual layer의 dilated convolution 다음에 bias term으로 embedding을 추가

- Unconditional Generation

Unconditional generation 환경에서 model은 conditional information 없이 consistent 한 utterance를 생성해야 함
- 이때 network의 output unit은 utterance length $L$보다 큰 receptive field $r$을 가져야 함
  - 실질적으로 $r \geq 2L$이 필요하므로 가장 왼쪽과 오른쪽의 output unit은 $L$-dimensional input을 covering 하는 receptive field를 가짐
- Dilated convolution layer stack의 경우, receptive field size는 최대 $r = (k-1)\sum_{i} d_{i} +1$
  - $k$ : kernel size, $d_{i}$ : $i$-th residual layer의 dilation
  - Layer 수와 dilation cycle을 더 늘릴 수 있지만, 더 deep 한 layer와 큰 dilation cycle의 경우 품질이 저하됨
- 결과적으로 DiffWave는 output $x_{0}$의 receptive field를 확장하는 이점이 있음
  - 이를 통해 reverse process에서 $x_{T}$에서 $x_{0}$까지 iterating 하면, receptive field size가 $T \times r$까지 증가하므로, unconditional generation도 대응할 수 있음

3. Experiments

- Settings

Dataset :
- Neural vocoding : LJSpeech
- Unconditional/Class-conditional generation : Speech Commands Dataset
Comparisons :
- Neural vocoding : WaveGlow, WaveFlow, ClariNet, WaveNet
- Unconditional/Class-conditional generation : WaveGAN, WaveNet

- Results

Neural Vocoding
- 128개의 residual channel을 사용하는 DiffWave-Large의 경우, MOS 측면에서 가장 우수한 합성 성능을 보임
  - 64개의 residual channel을 사용하는 DiffWave-Base는 적은 diffusion step 만으로도 4.35 MOS의 고품질 음성을 합성
  - 특히 DiffWave-Base는 enginerring optimization 없이도 real-time 보다 1.1배 빠른 합성 속도를 보임
- 추론 속도 측면에서 engineering optimization을 수행한 DiffWave-Base (fast)와 DiffWave-Large (fast)는,
  - 각각 real-time보다 5.6배, 3.5배 빠른 속도를 가지면서 우수한 audio fidelity를 보임

Unconditional Generation
- 1000개의 audio sample을 생성한 결과를 비교해 보면, DiffWave가 가장 우수한 합성 품질을 보임
- 특히 MOS 측면에서 DiffWave는 3.39로 WaveNet의 1.43 MOS, WaveGAN의 2.03 MOS보다 훨씬 뛰어남
- 그 외의 정량적인 지표 결과도 DiffWave가 다양성 및 품질 측면에서 더 뛰어나다는 것을 의미

Class-Conditonal Generation
- 0~9의 각 digit에 대해 100개의 audio sample을 생성한 결과를 비교해 보면, 마찬가지로 DiffWave가 가장 우수한 성능을 보임
- MOS 측면에서 DiffWave는 3.50을 달성하여 class-conditional generation에서도 뛰어난 합성 품질을 보임

Class-conditional Generation에 대한 성능 비교 결과

Additional Results
- DiffWave를 speech denoising으로도 확장할 수 있음
  - 이를 위해 SC09 dataset에서 실험해 보면, DiffWave는 noise type에 대한 knowledge 없이도 denoising 수행이 가능함
- 그 외에도 digit conditioned DiffWave 모델을 활용하여 interpolation을 수행할 수도 있음

'Paper > Vocoder' 카테고리의 다른 글

[Paper 리뷰] WaveGrad: Estimating Gradients for Waveform Generation (0)	2024.02.17
[Paper 리뷰] Avocodo: Generative Adversarial Network for Artifact-Free Vocoder (0)	2024.02.16
[Paper 리뷰] iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform (0)	2024.02.07
[Paper 리뷰] PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior (0)	2024.02.04
[Paper 리뷰] MISRNet: Lightweight Neural Vocoder Using Multi-Input Single Shared Residual Blocks (0)	2024.02.02

최근에 올라온 글

최근에 달린 댓글

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] DiffWave: A Versatile Diffusion Model for Audio Synthesis

DiffWave: A Versatile Diffusion Model for Audio Synthesis

1. Introduction

2. Diffusion Probabilistic Models

3. DiffWave Architecture

- Diffusion-Step Embedding

- Conditional Generation

- Unconditional Generation

3. Experiments

- Settings

- Results

'Paper > Vocoder' 카테고리의 다른 글

티스토리툴바