[Paper 리뷰] WaveGrad: Estimating Gradients for Waveform Generation

티스토리 뷰

Paper/Vocoder

[Paper 리뷰] WaveGrad: Estimating Gradients for Waveform Generation

feVeRin 2024. 2. 17. 11:33

WaveGrad: Estimating Gradients for Waveform Generation

Score mathcing과 diffusion probabilistic model을 waveform generation에 활용할 수 있음
WaveGrad
- Data density의 gradient를 추정하는 waveform generation을 위한 conditional model
- Gaussian white noise에서 시작하여 mel-spectrogram에 따라 condition 된 gradient-based sampler를 활용
논문 (ICRL 2021) : Paper Link

1. Introduction

Autorgressive 모델을 raw waveform 생성에서 활용할 수 있지만, sequential computation으로 인해 느린 추론 속도를 보임
- 이를 대체하는 non-autoregressive 방식에는 Normalizing flow, Generative Adversarial Network (GAN) 등이 있음
  - 이때 Non-autoregressive 방식은 sequential operation을 줄여 추론 속도를 향상할 수 있지만, autoregressive 모델에 비해 상대적으로 낮은 품질을 보임
- 이를 해결하기 위해 data log-density의 gradient를 추정하여 sampling하는 방식을 도입할 수 있음
  - 이는 log-likelihood의 weighted variational lower-bound를 최적화함으로써 간단히 학습됨

-> 그래서 data density의 gradient를 추정하는 conditional model인 WaveGrad를 제안

WaveGrad
- Stein score function과 같이 data log-density의 gradient learning을 통해 나타나는 generative model을 활용
- 추론 시에는, Langevin dynamics와 같은 gradient-based sampler를 활용하여 sampling을 수행함
- 특히 Diffusion probabilistic model은 이러한 방식과 밀접한 관련이 있음
  - Tractable likelihood 대신 log-likelihood에 대한 variational lower-bound를 최적화하는 방식을 활용할 수 있음
  - 이때 training objective는 denoising score matching으로 reparameterize 되고, data log-density gradient를 추정하는 것으로 interpret 될 수 있음
- 결과적으로 non-autoregressive 추론을 통해, Gaussian noise에서 시작하여 Langevin dynamics sampler를 통해 output waveform을 생성함

< Overall of WaveGrad >

Conditional 음성 합성을 위해 score matching과 diffusion probabilistic model을 결합
Discrete refinement step index로 condition된 WaveGrad와 noise level을 indicate 하는 continuous scalar를 활용한 WaveGrad를 비교
- Continuous 방식을 활용하는 것이 더 효과적임을 보임
결과적으로 WaveGrad는 high-fidelity의 합성이 가능하고 기존 non-autoregressive 방식들보다 뛰어난 품질을 보임

2. Estimating Gradients for Waveform Generation

Stein score function과 Langevin dynamics, Score matching에 대해 먼저 살펴보면,
- Stein score function은 datapoint $y$에 대한 data log-density $\log p(y)$의 gradient:
  (Eq. 1) $s(y)=\nabla_{y}\log p(y)$
- Stein score function $s(\cdot)$이 주어지면, Langevin dynamics를 통해 해당 density $\tilde{y} \sim p(y)$에서 sample을 얻을 수 있고, 이는 data space에서의 stochastic gradient ascent로 볼 수 있음:
  (Eq. 2) $\tilde{y}_{i+1} =\tilde{y}_{i}+\frac{\eta}{2}s(\tilde{y}_{i})+\sqrt{\eta}z_{i} $
  - $\eta >0$ : step size, $z_{i}\sim \mathcal{N}(0,I)$
  - 이때 $I$는 identity matrix
- 추론을 위해 Langevin dynamics를 사용하여 Stein score function을 직접 학습하도록 neural network를 구축하면 generative model을 얻을 수 있음
  - 이러한 방식을 score matching이라고 하고, 이때 denoising score matching objective는:
  (Eq. 3) $\mathbb{E}_{y\sim p(y)}\mathbb{E}_{\tilde{y}\sim q(\tilde{y}|y)}\left[|| s_{\theta}(\tilde{y})-\nabla_{\tilde{y}}\log q(\tilde{y}|y)||_{2}^{2}\right]$
  - $p(\cdot)$ : data 분포, $q(\cdot)$ : noise 분포
- 이때 data가 다양한 level의 Gaussian noise로 perturb 되고 score function $s_{\theta}(\tilde{y},\sigma)$가 사용된 noise의 표준편차 $\sigma$로 condition 되는, weighted denoising score matching objective를 활용할 수 있음:
  (Eq. 4) $\sum_{\sigma \in S}\lambda(\sigma)\mathbb{E}_{y\sim p(y)}\mathbb{E}_{\tilde{y}\sim \mathcal{N}(y, \sigma)}\left[ \left|\left| s_{\theta}(\tilde{y},\sigma)+\frac{\tilde{y}-y}{\sigma^{2}}\right|\right|_{2}^{2} \right]$
  - $S$ : data를 perturb 하기 위해 사용되는 표준편차의 집합
  - $\lambda(\sigma)$ : 서로 다른 $\sigma$에 대한 weighting function

결과적으로 WaveGrad는 이를 변형하여 $p(y|x)$ 형태의 conditional generative model을 학습
- 즉, WaveGrad는 data density의 gradient를 학습하고, 추론을 위해 Langevin dynamics sampler를 활용
- 이때 denoising score matching framework는 noise 분포에 의존하고, data log-density gradient에 대한 학습을 지원함
  - (Eq. 3)의 $q$, (Eq. 4)의 $\mathcal{N}(\cdot, \sigma)$에 해당
  - 따라서 noise 분포의 선택은 고품질 sample을 얻는데 중요한 역할을 함
- 이를 위해, WaveGrad는 diffusion model을 접목하여 score function을 학습하는 데 사용되는 noise 분포를 생성

- WaveGrad as a Diffusion Probabilistic Model

Diffusion probabilistic model과 score matching objective는 밀접하게 연관되어 있음
- 먼저 $y_{0}$ : waveform, $x$ : $y_{0}$에 대한 conditioning feature이라 하자
  - 이때 $x$는 text에서 파생된 linguistic feature나 $y_{0}$에서 추출된 mel-spectrogram, Text-to-Speech 모델로 얻어진 acoustic feature에 해당함
- Diffusion probabilistic model로써 WaveGrad를 정의하면, WaveGrad는 conditional 분포 $p_{\theta}(y_{0} |x)$를 모델링함:
  (Eq. 5) $p_{\theta}(y_{0}|x) := \int p_{\theta}(y_{0:N}|x)dy_{1:N}$
  - $y_{1}, ..., y_{N}$ : latent variable의 series, 각각은 data $y_{0}$와 동일한 dimension을 가짐
  - $N$ : latent variable 수
- Posterior $q(y_{1:N} |y_{0})$는 diffusion/forward process라고 하고, Markov chain을 통해 정의됨:
  (Eq. 6) $q(y_{1:N}|y_{0}):=\prod_{n=1}^{N}q(y_{n}|y_{n-1})$
  - 여기서 각 iteration은 Gaussian noise를 add 함:
  (Eq. 7) $q(y_{n}|y_{n-1}):=\mathcal{N}\left( y_{n};\sqrt{(1-\beta_{n})}y_{n-1},\beta_{n}I\right)$
- 어떤 noise schedule $\beta_{1},...,\beta_{N}$ 하에서, diffusion process는 모든 step $n$에 대해 closed form으로 계산될 수 있음:
  (Eq. 8) $y_{n}=\sqrt{\bar{\alpha}_{n}}y_{0}+\sqrt{(1-\bar{\alpha}_{n})}\epsilon$
  - $\epsilon \sim \mathcal{N}(0,I)$, $\alpha_{n} := 1-\beta_{n}$, $\bar{\alpha}_{n}:= \prod_{s=1}^{n}\alpha_{s}$
- 그러면 해당 noise 분포의 gradient는:
  (Eq. 9) $\nabla_{y_{n}} \log q(y_{n}|y_{0})= -\frac{\epsilon}{\sqrt{1-\bar{\alpha_{n}}}} $
- 이때 pair $(y_{0}, y_{n})$을 학습하고 $\epsilon_{\theta}$로 reparameterize 된 neural network를 도입하면, (Eq. 3)과 유사한 denoising score matching objective를 얻을 수 있음:
  (Eq. 10) $\mathbb{E}_{n,\epsilon}\left[ C_{n}||\epsilon_{\theta}(\sqrt{\bar{\alpha}_{n}}y_{0}+\sqrt{1-\bar{\alpha}_{n}}\epsilon, x, n)-\epsilon ||_{2}^{2}\right]$
  - $C_{n}$ : $\beta_{n}$에 relate 된 constant
  - 실적용에서는 $C_{n}$ term을 삭제하여 log-likelihood의 variational lower bound를 생성하는 것이 유리함
  - 추가적으로 기존의 $L_{2}$ distance를 $L_{1}$으로 대체하면 학습 안정성이 향상될 수 있음

- Noise Schedule and Conditioning on Noise Level

Score matching은 graident 분포 모델링을 지원하기 때문에 학습 중에 사용되는 noise 분포의 선택이 중요함
- Diffusion framework는 noise schedule이 $\beta_{1},...,\beta_{N}$에 의해 parameterize 되는 score matching 방식으로 볼 수 있음
  - 이는 일반적으로는 linear decay schedule과 같은 hyperparameter heuristic으로 결정됨
- 이때 추론 효율성을 위해 추론 반복 횟수 $N$을 최소화하려고 하면, noise schedule의 선택이 더욱 중요해짐
  - Superfluous noise가 있는 schedule은 waveform의 low amplitude detail을 recover 하지 못하고, little noise를 사용하는 경우, 모델이 수렴되지 않을 수 있음
- 한편으로 diffusion/denoising step $N$도 결정해야 함
  - $N$이 크면 sample 품질이 향상되지만, 계산 비용이 높아짐
  - $N$이 작으면 추론 속도는 빨라지지만, sample 품질이 낮아짐
- 따라서 $N$이 작을 때 high-fidelity의 audio를 얻기 위해서는, noise schedule과 $N$을 함께 tuning 하는 것이 중요함
  - 해당 hyperparameter가 제대로 tuning 되지 않으면, sampling 과정에서 분포에 대한 support가 부족해짐
  - 결과적으로 sampling trajectory가 학습된 condition에서 벗어나면, sampler가 제대로 수렴하지 않을 수 있음
이를 해결하기 위해, WaveGrad는 기존의 discrete iteration index $n$ 대신 continuous noise level $\bar{\alpha}$를 condition으로 하여 모델을 reparameterize 함
- 이때 loss는:
  (Eq. 11) $\mathbb{E}_{\bar{\alpha},\epsilon}\left[ ||\epsilon_{\theta}\left( \sqrt{\bar{\alpha}}y_{0}+\sqrt{1-\bar{\alpha}}\epsilon, x, \sqrt{\bar{\alpha}}\right)-\epsilon||_{1}\right]$
- 위 방식을 적용하기 위해서는 technical issue를 처리해야 함
  1. 이를 위해 (Eq. 10)의 discrete iteration index를 condition으로 한 diffusion probabilistic model 학습 과정에서 $n \sim Uniform (\{1, ..., N\})$을 sampling 한 다음, $\alpha_{n}$을 계산
  2. Continuous noise level을 직접적으로 conditioning 하려면, $\bar{\alpha}$에 대한 sampling procedure가 필요함
    - 이때 $\bar{\alpha}_{n} := \prod_{s}^{n}(1-\beta_{s})\in [0,1]$이므로 uniform 분포 $\bar{\alpha} \sim Uniform (0,1)$에서 sampling을 수행할 수 있음
    - BUT, 이 방식은 경험적으로 낮은 합성 품질을 제공함
- 따라서 hierarchical sampling 방식을 채택해 $S$ iteration으로 noise schedule을 정의하고, 해당하는 모든 $\sqrt{\bar{\alpha}_{s}}$를 계산:
  (Eq. 12) $l_{0}=1, \,\, l_{s}=\sqrt{\prod_{i=1}^{s}(1-\beta_{i})}$
  - 이를 통해 모델을 re-train 할 필요 없이 넓은 trajectory에 걸쳐 추론을 수행할 수 있음
  - 모델이 한번 학습되면 추론 중에 서로 다른 iteration $N$을 사용할 수 있기 때문
- WaveGrad의 training algorithm은, (위의 [Algorithm. 1] 참고)
  - Segment $(l_{s-1},l_{s})$를 제공하는 segment $s \sim U(\{ 1,...,S\})$를 sampling 한 다음,
  - 해당 segment에서 uniform 하게 sampling 하여 $\sqrt{\bar{\alpha}}$를 얻음

3. Experiments

- Settings

Dataset : 385 hours of English speech (Internal dataset)
Comparisons : WaveRNN, Parallel WaveGAN, MelGAN, Multi-Band MelGAN, GAN-TTS

- Results

MOS 측면에서 WaveGrad는 ground-truth에 가까운 합성 품질을 보였고, 다른 non-autoregressive 모델들보다 우수한 성능을 보임
- Real Time Factor (RTF) 측면에서 WaveGrad-BASE는 0.2 RTF와 4.4 MOS를 달성함
- WaveRNN의 경우 동일한 MOS 성능을 발휘하는데 20.1 RTF로 측정됨

추론 schedule이 효과적으로 동작하려면 2가지 condition을 만족해야 함
- $y_{N}$과 standard Normal 분포 $\mathcal{N}(0,I)$에 대한 KL-divergence $D_{KL} (q(y_{N}|y_{0}) || \mathcal{N}(0,I))$는 작아야 함
- $\beta$는 small value에서 시작해야 함
결과적으로 iteration 횟수가 적을 때, noise level이 더 잘 generalize 되는 것으로 나타남
- 이때 성능 저하는 크게 나타나지 않음

Discrete Index, Noise Level에 따른 WaveGrad 성능 비교

'Paper > Vocoder' 카테고리의 다른 글

[Paper 리뷰] MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis (0)	2024.02.24
[Paper 리뷰] WaveFlow: A Compact Flow-based Model for Raw Audio (0)	2024.02.18
[Paper 리뷰] Avocodo: Generative Adversarial Network for Artifact-Free Vocoder (0)	2024.02.16
[Paper 리뷰] DiffWave: A Versatile Diffusion Model for Audio Synthesis (0)	2024.02.11
[Paper 리뷰] iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform (0)	2024.02.07

최근에 올라온 글

최근에 달린 댓글

« 2025/12 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] WaveGrad: Estimating Gradients for Waveform Generation

WaveGrad: Estimating Gradients for Waveform Generation

1. Introduction

2. Estimating Gradients for Waveform Generation

- WaveGrad as a Diffusion Probabilistic Model

- Noise Schedule and Conditioning on Noise Level

3. Experiments

- Settings

- Results

'Paper > Vocoder' 카테고리의 다른 글

티스토리툴바