[Paper 리뷰] StyleGAN: A Style-Based Generator Architecture for Generative Adversarial Networks

티스토리 뷰

Paper/ETC

[Paper 리뷰] StyleGAN: A Style-Based Generator Architecture for Generative Adversarial Networks

feVeRin 2023. 9. 15. 16:01

StyleGAN: A Style-Based Generator Architecture for Generative Adversarial Networks

Style transfer의 개념을 빌린 Generative Adversarial Network (GAN)을 위한 generator architecture
High level attributes와 stochastic variation에 대한 unsupervised separation을 학습하여 이미지 합성에 대한 scale-specific control을 제공
StyleGAN
- 기존의 distribution quality metric에 대해 SOTA 성능을 달성
- 더 나은 interpolation property 및 latent factor variation에 대한 disentanglement를 제공
논문 (CVPR 2019) : Paper Link

1. Introduction

GAN을 활용한 이미지 합성은 급속도로 개선되고 있음
- 그러나 여전히 블랙박스로 동작하고, stochastic feature에 대한 이해가 부족
- Latent space에 대한 이해와 서로 다른 generator를 정량적으로 비교할 수 있는 방법도 명확히 제시되지 않음

-> 그래서 style transfer에 영감을 받아 이미지 합성 프로세스를 control 할 수 있는 generator architecture인 StyleGAN을 제시

StyleGAN
- Learned constant input에서 시작하여 latent code를 기반으로 각 convolution layer에서 이미지 style을 조정
  - 다양한 scale에서 이미지 feature의 강도를 제어 가능
- Noise injection을 결합한 architecture 구조
  - Stochastic variation에서 high level attribute에 대한 unsupervised separation을 이끌어냄
  - 직관적인 scale-specific mixing과 interpolation operation이 가능
- StyleGAN의 generator는 input latent code를 intermediate latent space에 삽입하여 network에서 variation factor가 표현되는 방식에 큰 영향을 미침
  - Input latent space는 training data의 확률 밀도를 따라야 하므로 피할 수 없는 entanglement를 유발
  - Intermediate latent space는 확률 밀도의 제한이 없으므로 disentanglement 허용이 가능
- Generator의 latent space disentanglement 정도를 측정하기 위해 정량화된 기준이 필요
  - Perceptual path length, Linear separability를 측정 지표로 제시

< Overall of StyleGAN >

Discriminator나 loss function에 대한 수정 없이 generator 변경에만 집중
기존 generator보다 더 다양한 variation factor에 대한 linear하고 disentangled representation을 허용
더 높은 품질과 다양한 variation을 제공하는 Flickr-Faces-HQ (FFHQ) dataset 제공

2. Style-Based Generator

일반적으로 latent code는 feed forward network의 첫번째 layer인 input layer를 통해 generator에 제공됨
Style-Based Generator는 input layer를 생략하고 learned constant에서 시작
- Input latent space $Z$에 latent code $z$가 주어지면,
  - Non-linear mapping network $f: Z \rightarrow W$는 $w \in W$를 생성
- Learned affine transformation
  - 합성 network $g$의 각 convolution layer 이후의 Adaptive Instance Normalization (AdaIN)을 control 하는 style $y = (y_{s}, y_{b})$로 $w$를 specialize
- AdaIN operation
  - $AdaIN(x_{i}, y) = y_{s, i} \frac{x_{i} - \mu(x_{i})} {\sigma(x_{i})} + y_{b,i}$
  - 각 feature map $x_{i}$는 개별적으로 noramlize 되고, style $y$에 해당하는 scalar component를 활용하여 scaled, biased 됨
  - $y$의 차원은 해당 feature map 수의 두 배
Style transfer의 접근 방식과의 비교
- 제안한 방식은 예제 이미지 대신 vector $w$에서 spatially invariant style $y$를 계산
- Style $y$라는 표현 : 유사한 network architecture가 feed forward style transfer, imag-to-image translation, domain mixture 등에서 style이라는 표현이 사용되었기 때문
- AdaIN은 일반적인 feature transformation 보다 효율적이고 compact 한 representation을 제공
Explicit noise input의 사용
- Generator가 stochastic detail을 생성할 수 있는 수단을 제공
- Uncorrelated Gaussian Noise로 구성된 single channel image로 구성
  - 합성 network의 각 layer에 대해 dedicated noise image를 제공
  - Noise image는 learned per-feature scaling factor를 사용하여 모든 feature map에 broadcasting 되고 해당 convolution의 출력에 추가됨
Style-Based Generator의 구성
- Input을 intermediate latent space $W$에 mapping 하고 AdaIN을 통해 generator를 control
- Non-linearity를 적용하기 전에 각 convolution 이후에 Gaussian noise를 추가
  - $A$ : learned affine transform
  - $B$ : learned per-channel scaling factor
- Mapping network $f$는 8개의 layer로 구성, 합성 network $g$는 18개 layer로 구성

- Quality of Generated Images

Style-Based Generator의 품질 향상을 확인하기 위해, 다양한 generator에 대한 Frechet Inception Distance (FID) 비교
- Datasets
  - CELEBA-HQ / FFHQ
- Comparisions
  - A : Baseline configuration (Progressive GAN)
  - B : Bilinear up/downsampling 적용
  - C : Mapping network, AdaIN 추가
  - D : Input layer 제거하는 대신 constant input에서 합성 시작
  - E : Noise input 추가
  - F : Mixing regularization 도입 (neighboring style decorrelate, enabling fine-grained control)
- Experiment Results
  - 합성 network가 AdaIN에 의해 control 되는 style을 입력받을 때 보다 의미 있는 결과를 생성함
  - Style-Based Generator (E)는 기존 generator (B)에 비해 FID를 20% 이상 향상

Style-Based Generator를 사용하여 FFHQ dataset으로부터 생성된 이미지 결과
- FID의 결과와 같이 평균적인 합성 품질이 높고 안경, 모자 등의 액세서리도 성공적으로 합성 가능
- $W$의 extreme region에서의 sampling을 회피하는 Truncation Trick의 활용
  - High resolution detail이 영향받지 않도록 low resolution을 선택적으로 truncation

Style-Based Generator가 FFHQ dataset으로부터 생성한 이미지 ($\psi=0.7$)

- Truncation Trick in $W$

Training data의 분포를 고려하면 low density 영역이 제대로 표현되지 않아 generator가 제대로 학습되기 어려움
- Truncated, Shrunk sampling space에서 latent vector를 활용하면, 약간의 variation 손실이 있지만 평균적인 이미지 품질 향상이 가능
Truncation Trick의 적용
- $W$의 질량 중심 $\bar{w} = E_{z \sim P(z)} [f(z)]$ 계산
  - FFHQ dataset에서 질량 중심은 평균 얼굴을 의미 ($\psi = 0$)
  - $\psi \rightarrow 0$으로 fade 하면 FFHQ dataset에서 모든 face는 평균 이미지로 수렴
- 주어진 $w$의 deviation을 $w' = \bar{w}+\psi(w-\bar{w})$로 scale
  - 이때, $\psi < 1$
- Orthogonal regularization이 사용되는 경우에도, network의 subset에 대해 truncation 적용이 가능
  - $W$ space에 대한 truncation은 loss function을 변경하지 않고도 안정적으로 동작 가능

3. Properties of the Style-Based Generator

Style-Based Generator를 사용하면 style에 대한 scale-specific modification을 통해 이미지 합성을 control 가능
- 학습된 분포에서 각 style에 대한 sampling을 수행하기 위해 mapping network와 affine transformation을 활용
- 각 style의 영향은 network에 localize 되어 있음
  - Style의 특정 subset을 수정하면 이미지의 특정 부분에만 영향을 줄 수 있음
Localization의 이유를 확인하기 위해,
- 각 channel을 평균 0과 단위 분산으로 적용되도록 style에 따라 AdaIN의 scale, bias를 적용
- 이때, 각 style은 다음 AdaIN으로 재정의되기 전에 하나의 convolution만 제어하도록 설정

- Style Mixing

Mixing Regularization
- Style localization을 활용하기 위해 두 개의 random latent code를 사용하여 주어진 비율에 따라 이미지를 생성
- Style Mixing : 합성 network의 무작위 선택 지점에서 하나의 latent code를 다른 code로 전환하는 것
  1. Mapping network를 통해 두 개의 latent code $z_{1}, z_{2}$를 얻음
  2. 연관된 $w_{1}, w_{2}$가 style을 control
  3. $w_{1}$는 crossover point 이전에 적용되고 $w_{2}$는 이후에 적용
- Mixing Regularization은 network가 adjacent style이 서로 correlate 되어 있다고 가정하는 것을 방지
Mixing Regularization을 사용했을 때 localization이 어떻게 향상되는지에 대한 실험 결과
- 1~4개의 latent와 이들 사이의 crossover point를 무작위로 지정하여 학습된 network에 대한 FID를 비교
- 여러 개의 latent가 혼합되었을 때, regularization을 활용하면 FID가 향상됨
  - Regularization을 사용하면 crossover에 대한 tolerance를 확보할 수 있음

Style의 각 subset은 합성된 이미지의 high-level attribute를 control 할 수 있음
- Coarse spatial resolution으로 style을 복사할 때
  - Source B에서 pose, 일반적인 hair style, face shape, eyeglasses 같은 high-level aspect를 가져오고, Source A에서는 color와 같은 finer facial feature를 가져옴
- Middle resolution으로 style 복사할 때
  - B에서 hair style, eye open/close 같은 더 작은 scale의 facial feature를 상속받고, A에서는 pose, 일반적인 face shape, eyeglasses를 가져옴
- Fine style을 복사할 때
  - B에서 color나 microstructure를 가져옴

두개의 latent code (Source A / B)를 다양한 scale로 mixing했을 때 생성된 이미지 예시

- Stochastic Variation

Human portrait에는 hair, stubble, freckles, skin pores 같은 stochastic 한 측면이 존재
- 올바른 분포를 따르는 이미지에 대해 우리 인식에 영향을 주지 않고 무작위로 배치할 수 있음
- 전통적인 generator는,
  - Network는 앞선 activation에서 spatially-varying pseudo-random number를 생성해야 함
  : Network에 대한 유일한 입력이 input layer를 통과해야 하기 때문
  - Network의 capacity를 증가시키고 생성된 signal의 periodicity를 숨기는 것이 어려움
  : 생성된 이미지에서 반복적인 패턴이 만들어지는 원인
- StyleGAN에서는,
  - 각 convolution 이후에 per-pixel noise를 추가하여 기존 generator의 문제를 개선

동일한 이미지에 대해 다양한 noise 적용을 통한 stochastic variation 결과 비교
- Noise를 적용했을 때 전체적인 이미지는 거의 동일하지만 개별적인 머리카락의 배치는 다르게 나타남
- Pixel에 대한 표준편차는 이미지에서 어떤 부분이 noise에 영향을 받는지 나타냄
  - Hair, silhouette, background, eye reflection에 대해 stochastic variation은 noise에 영향을 받음
  - Pose나 identity 같은 global 한 특징은 noise에 영향을 받지 않음
- Noise는 stocahstic 한 구성에만 영향을 미치고, 전체적인 동일성이나 high-level attribute는 그대로 유지시킴

Generator의 다양한 layer subset에 noise를 적용한 결과 비교
- Comparisions
  - (a) : 모든 layer에 noise 적용
  - (b) : noise를 적용하지 않은 경우
  - (c) : fine layer에만 noise를 적용
  - (d) : coarse layer에만 noise를 적용
- Experiment Results
  - Noise를 적용하지 않는 경우, 특징 없는 painterly 한 이미지가 생성
  - Coarse noise는 large hair curling이나 large background를 생성
  - Fine noise는 finer curling, detailed background, skin pores를 생성
Noise의 효과는 network와 밀접하게 localize 되어 나타남
- Generator에서 stochastic variation을 생성할 수 있는 가장 쉬운 방법은 noise를 추가하는 것임
- 모든 layer에 대해 noise를 적용할 수 있으므로 초기 activation에 대해 stochastic event를 발생시킬 필요가 없음

- Separation of Global Effects from Stochasticity

Style 변화는 전체적인 표현에 영향을 미치고, Noise는 중요하지 않은 stochastic variation에 영향을 미침
- Style transfer에서,
  - Spatially invariant statistics는 이미지의 style을 안정적으로 encoding 하고, Spatially varying feature는 특정 instance를 encoding 하는 것과 동일함
- Style-Based Generator에서,
  - Complete feature map이 동일한 값으로 scale 되고 bias 되기 때문에 style이 전체 이미지에 영향을 미침
  : Pose, lighting, background와 같은 global effect를 제어
  - Noise는 각 pixel에 독립적으로 추가되므로 stochastic variation을 제어
  : 만약 Network가 noise를 통해 global effect를 제어하려고 하면, discriminator에 의해 불이익을 받게 됨
- 따라서 StyleGAN은 explict guide 없이도 global, local channel을 적절하게 활용하는 방법을 학습할 수 있음

4. Disentanglement Studies

Disentanglement는 하나의 variation factor를 제어하는 각각의 linear subspace들로 구성된 latent space를 의미
- $Z$의 각 factor 조합의 sampling 확률은 training data의 density와 일치해야 함
- 이는 factor가 일반적인 dataset 및 input latent 분포에 대해 fully disentangle 되는 것을 방해함
Generator를 사용하면 intermediate latent space $W$가 고정된 분포에 따라 sampling을 지원할 필요가 없음
- Samling density는 learned piecewise continuous mapping $f(z)$에 따라 유도됨
  - 이러한 mapping은 $W$를 unwarp 하여 variation factor가 더욱 linear 하도록 조정할 수 있음
- Entangled representation 보다 Disentangled representation을 활용하면 더 현실적인 이미지를 생성할 수 있음
  - Generator는 variation factor가 알려지지 않은 unsupervised setting에서 disentangled 된 $W$를 얻는 것을 목표로 함
- Disentanglement를 정량화하기 위해서는 input 이미지를 latent code에 mapping 하는 encoder network가 필요
  - GAN에는 이러한 encoder가 없기 때문에 활용하기 어려움
- 따라서 추가적인 network의 사용 없이 disentanglement를 정량화하는 새로운 지표를 제시

- Perceptual Path Length

Latent space vector에 대한 interpolation은 imgae에 대한 상당한 non-linear 변화를 일으킬 수 있음
- Latent space가 entangle 되어 variation factor들이 적절하게 분리되지 않았기 때문
- For example,
  - Endpoint가 없는 feature가 linear interpolation path 중간에 나타나는 것
- Latent space에서 interpolation을 수행할 때 이미지의 변화를 측정하여 정량화 가능
  - Less curved latent space는 highly curved space보다 더 smooth 한 transition을 보임
Perceptual Path Length
- VGG16 embedding 간의 weighted difference로 계산되는 perceptually-based pairwise image distance를 사용
  - Weight는 human perceptual similarity judgment와 일치하도록 맞춤
- Latent space interpolation path를 linear segment로 세분화하면, segmented path의 total perceptual length를 각 segment에 대한 perceptual difference 합으로 정의할 수 있음
  - 수학적으로 perceptual path length는 아주 작은 subdivision들의 무한합으로 정의되어야 함
  - 실적용을 위해 subdivision을 $\epsilon = 10^{-4}$으로 근사
- 가능한 모든 endpoint에 대한 latent space $Z$의 평균 Perceptual Path Length:
  $l_{Z} = E[\frac{1}{\epsilon^{2}} d(G(slerp(z_{1}, z_{2}; t)), G(slerp(z_{1}, z_{2}; t+\epsilon)))]$
  - $z_{1}, z_{2} \sim P(z)$, $t \sim U(0,1)$
  - $G$ : generator (style-based network의 $g \circ f$)
  - $d(\cdot, \cdot)$ : 두 image들 간의 perceptual distance
  - $slerp$ : spherical interpolation (normalized input latent space에서 interpolation을 수행하는 방식)
  - $d$는 quadratic metric이기 때문에 $\epsilon^{2}$로 나누고, 100000개 sample에 대해 기댓값을 계산
- 마찬가지로, 가능한 모든 endpoint에 대한 $W$ space의 평균 Perceptual Path Length:
  $l_{W} = E[\frac{1}{\epsilon^{2}} d(g(lerp(f(z_{1}), f(z_{2}); t)), g(lerp(f(z_{1}), f(z_{2}); t+\epsilon)))]$
  - $l_{Z}$와의 차이는 interpolation이 $W$ space에서 수행되는 것
  - $W$의 vector들은 normalize 되어 있지 않기 때문에, linear interpolation ($lerp$)를 적용
Noise가 포함된 style-based generator에 대한 perceptual path length 비교
- 짧은 full path length는 $W$가 $Z$보다 더 linear 함을 나타냄
- 하지만 실제적으로는 input latent space $Z$에 살짝 bias 되어 있음
  - 만약 $W$가 $Z$와 완전히 disentangle 되어 있고 flatten 한 mapping이라면, input manifold에 없는 영역이 포함되어, generator는 잘못된 이미지를 재구성할 수 있음
  - 따라서 path endpoint를 $t \in \{0, 1\}$로 제한하면, $l_{Z}$에 영향을 받지 않으면서 더 작은 $l_{W}$를 얻을 수 있음

Mapping network의 path length에 대한 영향
- 일반적인 generator와 style-based generator 모두 mapping network를 사용하면 FID와 perceptual length가 향상됨
- 이때 일반적인 generator는 $l_{W}$가 향상되는 반면, $l_{Z}$가 약화되어 input latent space가 GAN에 의해 entangle 될 수 있음

- Linear Separability

Latent space가 충분히 disentangle 하면, 개별 variation factor에 대해 일관적인 방향 vector를 찾을 수 있음
- Latent-space point가 linear hyperplane에 의해 두 개의 distinct set으로 잘 분리되는지를 측정
- 각 set가 이미지의 특정 binary attribute에 해당하도록 효과를 정량화
  1. 생성된 이미지에 대한 label을 생성하기 위해 남녀 binary attribute를 구분하는 보조 classification network를 학습
    - Attribute에 대한 separability 측정을 위해 $z \sim P(z)$로 200000개의 이미지를 생성하고, 보조 network로 분류
    - Classifier confidence에 따라 sample을 정렬하고 confidence가 낮은 절반을 제거하여 100000개의 labeled latent-space vector를 생성
  2. 각 attribute에 대해 linear SVM을 적합하여 latent-space point를 기반으로 label을 예측하고 point를 분류
    - Latent-space point : 일반적인 generator의 경우 $z$, style-based generator의 경우 $w$
  3. Conditional Entropy $H(Y|X)$ 계산
    - $X$ : SVM에 의해 예측된 class, $Y$ : pretrained classifier에 의해 결정된 class
- Sample이 hyperplane의 어느 쪽에 있는지 알고 있는 경우, sample의 실제 class를 결정하기 위해 얼마나 많은 정보가 필요한지를 나타냄
  - 값이 낮을수록 해당 variation factor에 대한 일관된 latent-space 방향을 가짐을 의미
- 최종적인 Separability Score:
  $exp(\sum_{i} H(Y_{i}|X_{i}))$
  - $i$ : attribute
  - Exponentiation은 로그 값을 linear 영역으로 가져오므로 비교를 쉽게 함
다양한 Generator와 Mapping network 사용에 대해 separability score를 비교
- $W$가 $Z$ 보다 더 일관되게 분리 가능하므로 상대적으로 덜 entanlge 된 representation을 가짐
- Mapping network의 깊이를 늘리는 경우, $W$의 이미지 품질과 separability가 모두 향상됨
  - 합성 network가 disentangled representation을 선호하기 때문
- 일반적인 generator 앞에 mapping network를 추가하는 경우,
  - $Z$에서는 separability가 손실되지만, intermediate latent space $W$에서는 separability와 FID 모두 향상
  - Intermediate latent space는 training data의 분포를 따를 필요가 없기 때문

'Paper > ETC' 카테고리의 다른 글

[Paper 리뷰] Elucidating the Design Space of Diffusion-based Generative Models (1)	2024.04.07
[Paper 리뷰] Denoising Diffusion Probabilistic Models (0)	2024.03.03
[Paper 리뷰] Fast and Accurate Model Scaling (0)	2023.08.03
[Paper 리뷰] Lightweight Convolutional Neural Network Architecture Design for Music Genre Classification using Evolutionary Stochastic Hyperparameter Selection (0)	2023.07.11
[Paper 리뷰] beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework (0)	2023.04.26

최근에 올라온 글

최근에 달린 댓글

« 2025/12 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] StyleGAN: A Style-Based Generator Architecture for Generative Adversarial Networks

StyleGAN: A Style-Based Generator Architecture for Generative Adversarial Networks

1. Introduction

2. Style-Based Generator

- Quality of Generated Images

- Truncation Trick in $W$

3. Properties of the Style-Based Generator

- Style Mixing

- Stochastic Variation

- Separation of Global Effects from Stochasticity

4. Disentanglement Studies

- Perceptual Path Length

- Linear Separability

'Paper > ETC' 카테고리의 다른 글

티스토리툴바