[Paper 리뷰] Elucidating the Design Space of Diffusion-based Generative Models

티스토리 뷰

Paper/ETC

[Paper 리뷰] Elucidating the Design Space of Diffusion-based Generative Models

feVeRin 2024. 4. 7. 14:21

Elucidating the Design Space of Diffusion-based Generative Models

현재의 diffusion-based generative model은 불필요하게 복잡함
EDM
- Diffusion model에 대한 구체적인 design choice을 위한 명확한 design space를 제시
- 이를 위해 sampling, training process, score network의 pre-conditioning 등에 대한 다양한 변경 사항들을 identify 함
논문 (NeurIPS 2022) : Paper Link

1. Introduction

Diffusion-based generative model은 conditional/unconditional 설정 모두에서 뛰어난 합성 성능을 보이고 있음
- 이러한 diffusion model에 대한 개선은 sampling schedule, training dynamics, noise level parameterization 등으로 다양하게 파생되어 있음
- 따라서 diffusion model에 대한 available design space를 파악하기 어렵고, 전체 system을 건들지 않고서는 개별적인 component를 쉽게 수정할 수 없음

-> 그래서 practical standpoint에서 diffusion model에 대한 여러 설계 관점을 분석

EDM
1. Taining, sampling 단계에서 나타나는 tanglible object와 algorithm에 집중하여 분석
  - 이를 통해 각 component가 어떻게 연결되어 있는지, 전체 system 설계 시 어느 정도의 degree of freedom이 가능한지에 대한 insight를 얻는 것을 목표로 함
  - 이때 neural network를 사용하여 Gaussian noise로 corrupt 된 training data를 noise-level dependent marginal distribution을 score로써 모델링하는 denoising score matching에 중점을 둠
2. Diffusion model을 사용하여 image를 합성하는 sampling process에 대한 분석을 수행
  - 가장 성능이 좋은 discretization 방법, high-order Runge-Kutta method, 다양한 sampler schedule, sampling process에서의 stochasticity의 유용성을 평가
  - 해당 분석, 개선을 통해 얻어진 sampler를 활용하여 합성에 필요한 sampling step 수를 크게 줄임
3. Score-modeling neural network의 training 관점에서 여러 설정들을 분석
  - 일반적으로 사용되는 DDPM, NCSN을 기반으로 network input/output에 대한 pre-conditioning, diffusion model의 loss function 등을 평가하여 training dynamics를 개선할 수 있는 방법을 탐색
  - 추가적으로 training 중 noise level에 대한 imporved distribution를 제안하고, non-leaking augmentation이 diffusion model에 유용하다는 것을 밝힘

< Overall of EDM >

Diffusion model의 design space에 대한 종합적인 분석을 수행하여 성능을 개선할 수 있는 최적의 방안을 탐색
결과적으로 분석된 개선 방법을 적용하여 기존 diffusion model들보다 우수한 성능을 달성

2. Expressing Diffusion Models in a Common Framework

표준편차 $\sigma_{data}$를 갖는 $p_{data}(x)$로 data distribution를 나타내고, 표준편차 $\sigma$의 $i.i.d.$ Gaussian noise를 data에 추가하여 얻어진 mollified distribution $p(x;\sigma)$가 있다고 하자
- $\sigma_{\max} \gg \sigma_{data}$의 경우, $p(x;\sigma_{\max})$는 pure Gaussian noise와 practically indistinguishable 함
- Diffusion model의 아이디어는, noise image $x_{0} \sim \mathcal{N}(0, \sigma^{2}_{\max}I)$를 randomly sampling 한 다음, 각 $x_{i}\sim p(x_{i};\sigma_{i})$에서 noise level $\sigma_{0}=\sigma_{\max}>\sigma_{1}>...>\sigma_{N}=0$인 image $x_{i}$로 sequentially denoise 하는 것
  - 결과적으로 해당 process의 endpoint $x_{N}$은 data에 따라 distribute 됨
- Score matching에서는 sample $x$가 time에 따라 변화할 때 desired distribution $p$를 유지할 수 있는 Stochastic Differential Equation (SDE)를 활용함
  - 이를 통해 각 iteration에서 noise를 제거하고, 추가하는 stochastic solver를 활용하여 앞선 diffusion process를 구축할 수 있음
- 추가적으로 randomness의 source가 initial noise image $x_{0}$인 probability flow Ordinary Differential Equation (ODE)도 활용 가능함
  - 논문에서는 해당 ODE에서 시작하여 sampling trajectory와 discretization 설정을 검토

- ODE Formulation

Probability flow ODE는 timestep이 forward/backward로 이동할 때 image의 noise level을 continuously increase 하거나 reduce 함
- ODE를 공식화하기 위해서는, 먼저 time $t$에서 desired noise level을 define 하는 schedule $\sigma(t)$를 choice 해야 함
  - e.g.) $\sigma(t)\propto \sqrt{t}$로 설정하면 constant-speed heat diffusion에 해당하므로 mathmatically natural 함
  - BUT, 이때 schedule choice는 diffusion model에서 중요한 의미를 가지므로 단순한 theroretical convenience로써 choice 되어서는 안 됨
- Probability flow ODE의 defining characteristic은 sample $x_{a}\sim p(x_{a};\sigma(t_{a}))$를 time $t_{a}$에서 $t_{b}$로 진행했을 때, sample $x_{b} \sim p(x_{b};\sigma(t_{b}))$를 산출한다는 것 (forward/backward 모두에서)
  1. 이때 score matching에 따르면 아래 requirement를 만족함:
    (Eq. 1) $\mathrm{d}x=-\dot{\sigma}(t)\sigma(t)\nabla_{x}\log p(x;\sigma(t))\mathrm{d}t$
    - $\dot{\sigma}$ : time derivative
    - $\nabla_{x} \log p(x;\sigma)$ : score function으로써, 주어진 noise level에서 더 높은 data density를 가리키는 vector field
  2. 해당 ODE의 infinitesimal forward step은 noise level의 변화에 따라 sample을 data에서 멀리 떨어지게 함
    - 반대로 backward step을 수행하면 sample은 data distribution으로 이동

- Denoising Score Matching

Score function은 densitiy function $p(x;\sigma)$의 intractable normalization constant에 의존하지 않는다는 property를 가짐
- $D(x;\sigma)$가 아래와 같이 각 $\sigma$에서 $p_{data}$로부터 개별적으로 얻어진 sample들에 대해 expected $L_{2}$ denoising error를 최소화하는 denoiser function이라고 하면:
  (Eq. 2) $\mathbb{E}_{y\sim p_{data}}\mathbb{E}_{n\sim\mathcal{N}(0,\sigma^{2}I)}|| D(y+n;\sigma)-y ||_{2}^{2}$
  (Eq. 3) $\mathrm{then}\,\,\, \nabla_{x}\log p(x;\sigma)= (D(x;\sigma)-x)/\sigma^{2}$
  - $y$ : training image, $n$ : noise
- 이때 score function은 $x$의 signal에서 noise component를 isolate 하고, (Eq. 1)은 time에 따라 이를 amplify 하거나 diminish 함
  - 여기서 ideal 한 $D$의 behavior는 아래 그림과 같이 나타남
- Diffusion model에서의 key observation은 $D(x;\sigma)$가 (Eq. 2)에 따라 training 된 neural network $D_{\theta}(x;\sigma)$로 구현될 수 있다는 것임
  - 이때 $D_{\theta}$에는 $x$를 적절한 dynamic range로 scaling 하는 것과 같은 pre-/post-processing step을 사용할 수 있음

- Time-dependent Signal Scaling

Additional scale schedule $s(t)$를 도입하고 $x=s(t)\hat{x}$를 기존의 non-scaled variable $\hat{x}$의 scaled version이라고 하자
- 그러면 time에 따른 probability density가 변경되고 결과적으로 ODE solution trajectory도 변화함
- 이때 얻어지는 ODE는 (Eq. 1)의 generalization으로써:
  (Eq. 4) $\mathrm{d}x =\left[ \frac{\dot{s}(t)}{s(t)}x-s(t)^{2}\dot{\sigma}(t)\sigma(t)\nabla_{x}\log p\left(\frac{x}{s(t)};\sigma(t)\right)\right]\mathrm{d}t$
  - $p(x;\sigma)$의 definition을 $s(t)$와 independent 하게 유지하기 위해, score function을 evaluating 할 때 $x$의 scaling을 explicitly undo 함

- Solution by Discretization

ODE를 solve 하기 위해 (Eq. 3)을 (Eq. 4)로 substituting 하여 point-wise gradient를 정의하고, numerical integration을 사용하여 solution을 얻을 수 있음
- 즉, discrete time interval에 대해 finite step을 수행하는 것과 같음
- 이를 위해 discrete sampling time $\{t_{0}, t_{1},...,t_{N}\}$에서 동작하는 Euler method, Runge-Kutta method 같은 integration method가 필요함
  - 주로 사용되는 Euler method 보다 2nd order solver가 더 나은 computational trade-off를 제공하는 것으로 나타남

- Putting It Together

최종적으로 논문에서는 아래 [Table 1]과 같이 3가지의 기존 diffusion model들에 대해, deterministic variant를 위한 formula를 제시함
- 이러한 reframing의 목적은 기존 방법들에서 서로 복잡하게 얽혀있는 independent component를 찾아내는 것에 있음
- 결과적으로 논문에서 제시하는 EDM framework에는 각 component 간의 implicit dependency가 존재하지 않음
  - 즉, 하나의 component를 변경할 때 모델의 수렴성을 위해 다른 component를 수정할 필요가 없음
  - 실제로, 각 component에 대한 적절한 조합은 기존 방법들보다 더 잘 동작 가능함

각 Diffusion Model들에 적용할 수 있는 Design Choice

3. Improvements to Deterministic Sampling

Output quality의 향상과 sampling의 computational cost를 줄이는 것은 diffusion model에 대한 주요 개선 방향임
- 여기서 논문은 sampling process와 관련된 choice들이 network arhcitecture와 training detail과 같은 component들과 independent 하다고 가정함
  - 즉, $D_{\theta}$에 대한 training procedure는 $\sigma(t), s(t), \{t_{i}\}$를 dictate 해서는 안됨
  - Sampler 측면에서 $D_{\theta}$는 black box이기 때문
- 따라서 논문은 3가지 pre-trained diffusion model에 대해 서로 다른 sampler를 evaluating 하여 해당 결과를 분석함
  1. 이때 기존 sampler 구현을 사용하여 각 모델에 대한 baseline result를 얻은 다음, 앞선 [Table 1]에서 제시된 formula들을 사용하여 해당 sampler들을 수정
    - 이를 통해 다양한 choice들을 평가하여 diffusion model의 sampling process에 대한 general improvement를 제시함
  2. Comparisons
    - DDPM++ const (VP) : DDPM을 기반으로 variance preserving (VP)를 적용
    - NCSN++ const (VE) : SMLD를 기반으로 variance exploding (VE)를 적용
    - DDIM : imporved DDPM을 활용하는 ADM (dropout) 모델
  3. Reults
    - 아래 그림과 같이 Neural Function Evaluation (NFE)를 Frechet Inception Distance (FID)의 함수로 나타내어 합성된 image의 품질을 확인해 보면,
    - Sampling process가 $D_{\theta}$의 cost에 전적으로 좌우된다는 점을 고려하면 NFE의 향상은 sampling 속도의 향상으로 볼 수 있음
    - 특히 기존 deterministic sampler에 비해 논문에서 수정된 sampler는 일관적으로 더 나은 결과를 보임

각 Diffusion Model에서 Deterministic Sampler의 성능 비교

- Discretization and High-order Integrators

ODE를 numerically solve 하는 것은 true solution trajectory를 따르는 approximation과 같음
- 이때 각 step에서 solver는 $N$ step에 걸쳐 누적되는 truncation error를 발생시키고, local error는 일반적으로 step size에 의해 super-linearly scale 되므로 $N$을 늘리면 solution의 accuracy가 향상됨
  - 일반적으로 사용되는 Euler method는 step size $h$에 대해 $\mathcal{O}(h^{2})$ local error를 가지는 first-order ODE solver임
  - 반면 higher-order Runge-Kutta method는 더 유리하게 사용될 수 있지만, step 당 $D_{\theta}$를 multiple evaluation 해야 함
- 논문에서는 2nd-order Heun method가 diffusion model에서 truncation error와 NFE 간의 최적의 trade-off를 제공하는 것을 발견함
  1. 이를 위해 아래의 [Algorithm 1]과 같이 $t_{i}$와 $t_{i+1}$ 사이의 $\mathrm{d}x/\mathrm{d}t$의 변화를 설명할 수 있는 $x_{i+1}$에 대한 additional correction step을 도입함
  2. 이러한 correction은 step 당 $D_{\theta}$를 한 번 더 evaluation 하는 대신 $\mathcal{O}(h^{3})$의 local error를 발생시킴
    - $\sigma=0$으로 stepping 하면 0으로 나누는 문제가 발생하므로, 해당 경우에는 Euler method로 revert 함
- Time step $\{t_{i}\}$는 step size와 truncation error가 서로 다른 noise level에서 어떻게 distribute 되는지를 결정함
  1. 결과적으로 $\sigma$가 감소함에 따라 step size는 monotonically decrease 해야 하는 것으로 나타남
  2. 논문에서는 time step이 noise level의 sequence $\{\sigma_{i}\}$에 따라 정의되는 paramterized scheme을 활용함
    - i.e.) $t_{i} =\sigma^{-1}(\sigma_{i})$
  3. $\sigma_{i<N}=(Ai+B)^{\rho}$라고 하고, $\sigma_{0}=\sigma_{\max}, \sigma_{N-1}=\sigma_{\min}$이 되도록 constant $A, B$를 select하면:
    (Eq. 5) $\sigma_{i<N}=\left({\sigma_{\max}}^{\frac{1}{\rho}}+\frac{i}{N-1}\left({\sigma_{\min}}^{\frac{1}{\rho}}-{\sigma_{\max}}^{\frac{1}{\rho}}\right)\right)^{\rho}, \,\, \mathrm{and}\,\, \sigma_{N}=0$
    - $\rho$ : $\sigma_{\max}$ 근처의 longer step을 expense 하여 $\sigma_{\min}$ 근처의 step이 단축되도록 제어
    - $\rho=3$일 때는 각 step에서 truncation error가 거의 동일하지만, $\rho$가 5~10일 때는 우수한 sampling 성능을 보임
    - 즉, $\sigma_{\min}$ 주변의 error가 큰 영향력을 가진다는 것을 의미하고, 논문에서는 $\rho=7$로 설정하여 사용함
- Heun method에 기반한 (Eq. 5)를 sampler로 사용했을 때, 가장 낮은 NFE를 보이면서 Euler method와 동일한 FID를 얻을 수 있음

- Trajectory Curvature and Noise Schedule

ODE solution trajectory는 function $\sigma(t), s(t)$에 의해 정의되는데, 해당 function에 대한 choice는 $\mathrm{d}x/\mathrm{d}t$의 curvature에 비례하여 scale될 수 있으므로, truncation error를 줄일 수 있는 방법을 제공함
- 논문에서는 해당 function에 대한 최적의 choice를 $\sigma(t)=t, s(t)=1$로 제시
  - 해당 choice를 통해 (Eq. 4)는 $\mathrm{d}x/\mathrm{d}t = (x-D(x;t))/t$로 simplify 되고, $\sigma$와 $t$는 서로 interchangeable 해짐
- 임의의 $x, t$에서 $t=0$에 대한 single Euler step은 denoised image $D_{\theta}(x;t)$를 생성하므로, solution trajectory의 접선은 항상 denoiser output을 향함
  1. 즉, linear solution trajectory에 해당하는 noise level에 따라 천천히 변화한다고 볼 수 있고, 실제로 아래 그림에서 (c)의 1D ODE는 이러한 효과를 나타냄
  2. 마찬가지로 (b)의 real data에서도 동일한 결과를 확인할 수 있음
    - 여기서 서로 다른 denoiser target 간의 변화는 상대적으로 좁은 $\sigma$ range에서 발생함
    - 이는 advocated schedule에 따라 ODE curvature가 동일한 range로 제한되는 것과 일치한다고 볼 수 있음

$p_{data}$가 $x=\pm 1$에서 2개의 Dirac Peak를 가질 때의 ODE Curvature

- Discussion

Deterministic sampling을 개선하기 위해 논문에서 제시한 방법들은 앞선 [Table 1]의 sampling 부분과 같음
- 이를 적용해 보면 VP에서 7.3배, VE에서 300배, DDIM에서 3.2배의 NFE를 개선하여 가속효과를 얻을 수 있음
  - 실제로 single NVIDIA V100에서 초당 26.3개의 CIFAR-10 image를 생성 가능함
- 이러한 결과는 sampling process가 각 모델이 train 된 방식과 orthogonal 하다는 논문의 가정과 일치함

4. Stochastic Sampling

Deterministic sampling은 ODE를 inverting 하여 실제 image를 latent representation으로 변환하는 기능을 제공함
- BUT, ODE는 각 step에서 image에 noise를 inject 하는 stochastic sampling보다 output 품질이 떨어지는 경향이 있음
- 따라서 ODE와 SDE가 이론적으로는 동일한 distribution을 recover 한다는 점을 고려했을 때, stochasticity의 역할을 정확히 파악할 필요가 있음

- Background

먼저 SDE는 (Eq. 1)의 probability flow ODE와 time-varying Langevin diffusion SDE의 합으로 generalize 됨:
(Eq. 6) $\mathrm{d}x_{\pm} = \underbrace{-\dot{\sigma}(t)\sigma(t)\nabla_{x}\log p(x;\sigma(t))\mathrm{d}ta}_{\textrm{probability flow ODE} \,\, (\textbf{Eq. 1})} \pm \underbrace{\underbrace{\beta(t)\sigma(t)^{2}\nabla_{x}\log p(x;\sigma(t))\mathrm{d}t}_{\textrm{deterministic noise decay}} + \underbrace{\sqrt{2\beta(t)}\sigma(t)\mathrm{d}\omega_{t}}_{\textrm{noise injection}}}_{\textrm{Langevin diffusion SDE}}$
- $\omega_{t}$ : standard Wiener process, $\mathrm{d}_{x_{+}}, \mathrm{d}_{x_{-}}$ : Anderson의 time reversal formula와 관련하여, time을 forward/backward로 이동시키는 SDE
- Langevin term은 deterministic score-based denoising term과 stochastic noise injection term의 합으로 구성됨
- $\beta(t)$는 기존 noise가 새로운 noise로 replace 되는 relative rate를 나타냄
- SDE에서는 $\beta(t)=\dot{\sigma}(t)/\sigma(t)$를 사용하여 forward process에서 score를 vanish시킴
Implicit Langevin diffusion은 주어진 time에서 desired marginal distribution으로 sample을 drive 하여 이전 sampling step들에서 발생하는 error를 수정함
- 이때 discrete ODE solver를 사용하여 Langevin term을 근사하면, 그 자체로 error가 발생함
- 한편으로 non-zero $\beta(t)$를 사용하는 방법이 제시되긴 했지만, 일반적으로 score-matching에서 $\beta(t)$에 대한 implicit choice를 활용할 수 있는 special property는 존재하지 않음
  - 따라서 최적의 stochasticity는 empirically determine 되어야 함

- Stochastic Sampler

EDM은 2nd order deterministic ODE integrator와 noise 추가/제거에 대한 explicit Langevin-like churn을 결합한 stochastic sampler를 제안함 (아래 [Algorithm 2])
- 각 step $i$에서 noise level $t_{i}$ ($=\sigma(t_{i})$)의 sample $x_{i}$가 주어지면, 다음의 두 sub-step을 수행함
  1. 먼저 higher noise level $\hat{t}_{i}=t_{i}+\gamma_{i}t_{i}$에 reach 하기 위해 factor $\gamma_{i}\geq 0$에 따라 sample에 noise를 추가함
  2. 이후 increased-noise sample $\hat{x}_{i}$에서 single step으로 $\hat{t}_{i}$에서 $t_{i+1}$까지의 backward ODE를 solve 함
    - 이는 noise level $t_{i+1}$에서 sample $x_{i+1}$을 생성함
- 해당 방식과 Euler-Maruyama 간의 주요한 차이점을 알아보면,
  1. (Eq. 6)을 discretize 할 때, Euler-Maruyama는 noise injection 이후 intermediate state가 아니라 iteration step 시작 시 $x, \sigma$가 initial state로 유지된다고 가정함
    - 따라서 먼저 noise를 추가하고 ODE step을 수행하는 것으로 볼 수 있음
  2. 반면 논문에서 제안하는 EDM의 sampler는 [Algorithm 2]의 7행에서 $D_{\theta}$를 evaluate 하는 데 사용된 parameter가 noise injection 이후의 state를 나타냄
    - 이때 Euler-Maruyama-like method는 $x_{i};t_{i}$ 대신에 $\hat{x}_{i};\hat{t}_{i}$를 사용함
  3. 결과적으로 0에 approaching 하는 $\Delta_{t}$의 극한에서는 해당 choice 간의 차이가 없을 수 있지만, large step에서 낮은 NFE를 원하는 경우 그 차이는 커질 수 있음

- Practical Considerations

Stochasticity의 양을 늘리는 것은 이전 sampling step에서 발생한 error를 수정하는데 효과적임
- BUT, 아래와 같은 몇 가지 단점이 있음
  1. 먼저 과도한 Langevin-like noise 추가/제거로 인해 모든 dataset와 denoiser network를 사용하여 얻어진 image의 detail이 손상되는 것으로 나타남
  2. 특히 매우 낮거나 높은 noise level에서는 color가 oversaturate 되는 경향이 나타남
    - 이는 practical denoiser가 (Eq. 3)에서 slightly non-conservative vector field를 유도하여 Langevin diffusion의 premise를 violating 하기 때문
  3. 실제로 analytical denoiser를 사용하는 경우 위와 같은 degradation이 나타나지 않음
- Degradation이 $D_{\theta}(x;\sigma)$로 인해 발생하는 경우, sampling 중에 heuristic 한 방법을 통해서만 해결할 수 있음
  - 따라서 specific range의 noise level $t_{i}\in [S_{\mathrm{tmin}},S_{\mathrm{tmax}}]$ 내에서만 stochasticity를 enabling 하여 oversaturated color로의 drift를 방지함
- 먼저 해당 noise level에 대해 $\gamma_{i}=S_{churn}/N$을 정의하자
  1. 여기서 $S_{churn}$은 stochasticity의 양을 제어하고, image에 이미 존재하는 것보다 더 많은 nosise가 발생하지 않도록 $\gamma_{i}$를 clamp 함
  2. 이때 $S_{noise}$를 1보다 약간 크게 설정하여 새로 더해진 noise에 대한 표준편차를 inflate 하여 detail의 손상을 방지할 수 있음
    - 이는 $D_{\theta}(x;\sigma)$에 대한 non-conservativity 가정의 주요 component가 많은 noise를 제거하려는 경향이 있기 때문
    - 즉, $L_{2}$-trained denoiser로 인해 평균으로의 회귀가 발생할 가능성이 높음

- Evaluation

아래 그림과 같이 제안된 stochastic sampler는 low step count에서 기존 sampler보다 우수한 성능을 보임
- 결과적으로 이러한 sampler 수정만으로 FID 2.07을 달성한 기존 ImageNet-64 model을 1.55로 크게 개선함
- 한편으로 stochasticity로 인해 최적의 결과를 얻기 위해서는 implict/explicit 한 heuristic choice가 항상 요구된다는 한계가 있음
  - 따라서 논문에서는 grid search를 사용하여 case-by-case로 $\{S_{churn}, S_{\mathrm{tmin}}, S_{\mathrm{tmax}}, S_{noise}\}$의 최적 값을 탐색

5. Pre-conditioning and Training

$D$를 모델링하기 위해 neural network를 직접 training 하는 것은 ideal 하지 않음
- Input $x= y+n$은 clean signal $y$와 noise $n\sim \mathcal{N}(0,\sigma^{2}I)$의 combination이므로 magnitude는 noise level $\sigma$에 따라 크게 달라질 수 있기 때문
  - 따라서 $D_{\theta}$를 neural network로 직접 나타내지 않고, 대신 $D_{\theta}$에서 파생되는 다른 network $F_{\theta}$를 training 함
- 기존에는 $\sigma$-dependent normalization factor를 통해 input scaling을 처리하고 unit variance로 scale 된 $n$을 예측하기 위해 $F_{\theta}$를 training 하여 output을 pre-conditioning 함
  - 여기서 signal은 $D_{\theta}(x;\sigma)=x-\sigma F_{\theta}(\cdot)$을 통해 reconstruct 됨
- 이러한 방식은 large $\sigma$에서 noise $n$을 정확하게 cancel out 하고 correct scale로 output을 제공하기 위해 fine-tuning이 필요함
  1. 이때 network에서 발생하는 모든 error는 $\sigma$의 factor 만큼 amplify 되므로, expected output $D(x;\sigma)$를 직접 예측하는 것이 더 쉬울 수 있음
  2. 따라서 기존의 parameterization과 동일하게, 논문에서는 $y$나 $n$을 추정할 수 있는 $\sigma$-dependent skip connection으로써 neural network를 pre-conditioning 함
  3. 결과적으로 $D_{\theta}$는 다음과 같이 나타낼 수 있음:
    (Eq. 7) $D_{\theta}(x;\sigma)=c_{skip}(\sigma)x+c_{out}F_{\theta}(c_{in}(\sigma)x; c_{noise}(\sigma))$
    - $F_{\theta}$ : train 할 neural network, $c_{skip}(\sigma)$ : skip connection에 대한 modulate
    - $c_{in}(\sigma), c_{out}(\sigma)$ : input/output magnitude, $c_{noise}(\sigma)$ : noise level $\sigma$를 $F_{\theta}$에 대한 condition으로 mapping
- Noise level에 대해 (Eq. 2)의 weighted expectation을 취하면 overall training loss $\mathbb{E}_{\sigma, y,n}[\lambda(\sigma)||D(y+n;\sigma)-y||_{2}^{2}]$을 얻을 수 있음
  - $\sigma \sim p_{train}, y\sim p_{data}, n \sim \mathcal{N}(0,\sigma^{2}I)$
  - 여기서 주어진 noise level $\sigma$를 sampling 할 확률은 $p_{train}(\sigma)$로 주어지고, 해당 weight는 $\lambda(\sigma)$로 주어짐
- (Eq. 7)에서 raw network output $F_{\theta}$에 대해 해당 loss를 equivalent 하게 나타낼 수 있음:
  (Eq. 8) $\mathbb{E}_{\sigma,y,n}[\underbrace{\lambda(\sigma)c_{out}(\sigma)^{2}}_{\textrm{effective weight}}||\underbrace{F_{\theta}(c_{in}(\sigma)\cdot(y+n);c_{noise}(\sigma))}_{\textrm{network output}}-\underbrace{\frac{1}{c_{out}(\sigma)}(y-c_{skip}(\sigma)\cdot(y+n))}_{\textrm{effective training target}}||_{2}^{2}]$
  - 이는 $F_{\theta}$의 effective training target을 나타내므로 pre-conditioning function에 대한 적절한 choice를 결정할 수 있음
  - 결과적으로 network input과 training target에 unit variance $(c_{in}, c_{out})$을 사용하고, $c_{skip})$을 통해 $F_{\theta}$의 error를 가능한 적게 amplifiying 하여 [Table 1]의 결과를 도출함
  - $c_{noise}$에 대한 formula는 empirically chosen 됨
- 아래의 표는 앞선 deterministic sampler를 사용하여 얻어진 FID 결과를 보여줌
  1. 실제로 기존 설정인 $\{c_{in}, c_{out}, c_{noise}, c_{skip}\}$을 pre-condition로 대체하면 (config D), $64\times 64$ resolution에서 향상되는 결과를 보여주는 VE를 제외하고 FID가 크게 변하지 않고 유지되는 것으로 나타남
  2. 즉, pre-conditioning은 FID를 직접적으로 개선하는 대신, training 과정을 robust 하게 만들어 loss function을 효과적으로 redesign 하는데 도움을 줌

Deterministic Sampler를 활용한 Training Configuration 별 성능 비교

- Loss Weighting and Sampling

(Eq. 8)은 (Eq. 7)의 pre-condition에 따라 $F_{\theta}$를 training하면 $\lambda(\sigma)c_{out}(\sigma)^{2}$의 weight로 effective per-sample loss가 발생하는 것을 나타냄
- Effective loss weight를 balance 하기 위해, $\lambda(\sigma)=1/c_{out}^{2}$로 설정함
  - 이는 아래 그림의 (a)와 같이, 전체 $\sigma$ range에 대한 initial training loss도 equalize 함
- 추가적으로 training 중 noise level을 choice 하는 $p_{train}(\sigma)$도 선택되어야 함
  - Training 이후 per-$\sigma$ loss를 확인해 보면, intermediate noise level에서만 상당한 reduction이 나타남
  - 매우 낮은 level에서는 vanishingly small noise component를 discern 하기가 어렵고, 매우 높은 level에서는 training objective가 dataset average의 정답과 항상 dissimilar 함
- 따라서 논문은 [Table 1]과 같이 $p_{train}(\sigma)$에 대한 simple log-normal distribution을 사용함
  - 결과적으로 위의 표와 같이, 제안한 $p_{train}$과 $\lambda$ (config E)를 pre-conditioning (config D)와 함께 사용했을 때, 모든 경우에서 FID의 상당한 개선으로 이어짐

- Augmentation Regularization

작은 dataset으로 diffusion model에 대한 potential overfitting을 방지하기 위해, augmentation pipeline을 활용함
- 해당 pipeline은 noise injection 이전에 training image에 대해 다양한 geometric transformation을 반영함
- 여기서 생성된 image에 augmentation이 누출되지 않도록 $F_{\theta}$에 대한 conditioning input으로 augmentation parameter를 제공함
  - 추론 시에는 non-augmented image만 생성되도록 해당 parameter를 0으로 설정함
- 결과적으로 위의 표와 같이 unconditional CIFAR-10에 대해 1.79, 1.97 FID의 우수한 성능을 달성함

- Stochastic Sampling Revisited

아래 그림의 (b), (c)와 같이 stochastic sampling과의 연관성은 모델 자체가 개선됨에 따라 감소하는 것으로 나타남
- 실제로 training 단계에서는 (b)의 결과와 같이 deterministic sampling으로 가장 우수한 결과를 얻었음
- 반면 CIFAR-10에서 stochastic sampling은 training 단계에서 좋지 않을 수 있음

- ImageNet-64

최종적으로 제안된 training improvement를 사용하여 class-conditional ImageNet-64 모델을 scratch로 training 해보면
- 이때 ADM architecture를 기반으로 (config E)를 사용하여 training 함
- 결과적으로 얻어진 모델은 기존의 1.48 FID 보다 우수한 1.38 FID를 달성
  - 특히, 앞선 CIFAR-10의 결과와 달리 ImageNet에서는 stochastic sampling이 deterministic sampling보다 우수한 성능을 보임

'Paper > ETC' 카테고리의 다른 글

[Paper 리뷰] Score-based Generative Modeling through Stochastic Differential Equations (0)	2024.05.18
[Paper 리뷰] Denoising Diffusion Probabilistic Models (0)	2024.03.03
[Paper 리뷰] StyleGAN: A Style-Based Generator Architecture for Generative Adversarial Networks (0)	2023.09.15
[Paper 리뷰] Fast and Accurate Model Scaling (0)	2023.08.03
[Paper 리뷰] Lightweight Convolutional Neural Network Architecture Design for Music Genre Classification using Evolutionary Stochastic Hyperparameter Selection (0)	2023.07.11

최근에 올라온 글

최근에 달린 댓글

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Elucidating the Design Space of Diffusion-based Generative Models

Elucidating the Design Space of Diffusion-based Generative Models

1. Introduction

2. Expressing Diffusion Models in a Common Framework

- ODE Formulation

- Denoising Score Matching

- Time-dependent Signal Scaling

- Solution by Discretization

- Putting It Together

3. Improvements to Deterministic Sampling

- Discretization and High-order Integrators

- Trajectory Curvature and Noise Schedule

- Discussion

4. Stochastic Sampling

- Background

- Stochastic Sampler

- Practical Considerations

- Evaluation

5. Pre-conditioning and Training

- Loss Weighting and Sampling

- Augmentation Regularization

- Stochastic Sampling Revisited

- ImageNet-64

'Paper > ETC' 카테고리의 다른 글

티스토리툴바