[Paper 리뷰] How Powerful Are Performance Predictors in Neural Architecture Search?

티스토리 뷰

Paper/NAS

[Paper 리뷰] How Powerful Are Performance Predictors in Neural Architecture Search?

feVeRin 2023. 5. 28. 11:47

How Powerful Are Performance Predictors in Neural Architecture Search?

Neural Architecture Search (NAS)의 계산 비용을 줄이기 위해서 neural architecture의 성능을 예측하는 방식이 제안됨
Performance predictor는 NAS에서 효과적이지만, 최적화된 Initialization time, Query time 설정과 합의된 평가 지표의 부족으로 다양한 Performance predictor들 간의 효과적인 비교가 어려움
여러 Performance predictor들에 대한 대규모 연구 결과 제시
- Learning curve extrapolation, Weight-sharing, Supervised learning, Zero-cost proxy 등 31개의 predictor
- NAS 속도 및 다양한 설정에서 상관관계 및 순위 기반 성능 테스트
논문 (NeurIPS 2021) : Paper Link

1. Introduction

NAS는 주어진 데이터셋에 대한 neural architecture 개발 프로세스를 자동화하는 것을 목표로 함
- 초기 NAS에서는 수천개의 architecutre를 완성하고 validation accuracy를 통해 성능을 평가
- 최근에는 architecture encoding을 사용해 validation accuracy를 예측하는 Performance prediction 방식을 사용
  
  1. Gaussian Process, Neural Network, Tree-based method
  : 수백개의 완전 훈련된 architecutre가 필요하므로 Initialization time이 오래 걸림
  
  2. Learning Curve Extrapolation
  : Initialization time은 거의 없지만, 개별적인 예측을 위해 부분적인 학습이 필요하므로 Query time이 오래 걸림
  
  3. Weight-sharing
  : NAS에서 인기 있는 방식이지만, ranking architecture에서도 효율적인지는 의문
여러 Performance predictor들이 사용 가능 하지만 어떤 방식이 더 효과적인지 서로 비교할 수 있는 기준이 없음

-> 그래서 Performance predictor에 대한 대규모 연구 수행

4개의 search space, 4개의 dataset, 31개의 performance predictor
다양한 Initialization time, Query time에 대한 Pearson 상관관계 및 순위 상관 비교
무작위로 균등하게 생성된 architecutre에서 최고 성능 architecutre를 mutate하여 실험 진행
Bayesian optimization, Evolution search를 가속화하는 predictor의 성능을 테스트

< Overall of this paper >

여러 Performance predictor에 대한 대규모 연구 결과 제시
31개의 predictor와 4개의 search space를 포함하는 종합적인 library 제공
다양한 predictor를 결합하여 단일 predictor 보다 더 나은 예측력을 달성할 수 있음을 보임

2. Performance Prediction Method for NAS

- NAS와 Performance Prediction

NAS의 목표 : search space $A$가 주어졌을 때 $a^{*} \in argmin_{a \in A}f(a)$를 찾는 것
- $f$ : 고정된 epoch $E$에 대한 고정된 데이터셋으로 학습된 architecutre $a$의 validation error
  -> 이때 $f(a)$를 평가하는 것은 오랜 시간이 걸리므로 이 시간을 줄이기 위해 Performance predictor가 사용됨
- $f^{'}$ : Performance predictor, 일반적으로 architecture를 완전히 학습하지 않고 architecutre의 최종 accuracy나 ranking을 예측하는 함수
  -> $f^{'}$를 평가하는 것은 $f$를 평가하는 것보다 시간이 적게 걸림
  -> 이상적으로 $\{f^{'}(a) | a \in A\}$는 $\{f(a) | a \in A\}$와 높은 상관관계 혹은 순위 상관을 가짐
- Performance prediction은 두 단계로 수행됨
  1. Initialization routine : 일반적인 pre-computation 수행
  2. Query routine : 최종 architecture의 specifiaction을 입력으로 받아서 예측 정확도를 계산
- For example,
  - Early Stopping : 모든 query(a)에 대해서 $E$ 대신 $E/2$ epoch으로 architecture $a$를 학습
  -> pre-computation이 없으므로 Initialization time은 0
  -> 각 입력 architecture에 대한 Query time은 $E/2$ epoch에 대한 학습 시간
Initialization time과 Query time의 실행 시간은 predictor에 따라 다양함
- Initialization은 NAS 알고리즘 시작시 한번만 수행됨
- Query는 NAS 알고리즘 전체에 걸쳐 여러번 수행됨
- 몇몇 predictor는 Initialization 단계의 계산을 업데이트 하는 update 단계가 있을 수 있음

- Performance Predictor의 종류

1. Model-based Method

Supervised learning 기반의 가장 일반적인 predictor 유형
- 입력 feature는 architecture의 adjacency matrix representation
- Initialization 단계
  - datapoint $\{a, f(a)\}$의 training set을 구축하기 위해 다량의 architecture를 완전 학습함
  - predictor model $f^{'}$는 주어진 architecture $a$를 통해 $f(a)$를 예측하도록 training 됨
- Query 단계
  - 학습된 predictor를 기반으로 architecture $a$에 대한 예측 결과 출력
Model-based predictor는 Initialization time이 높지만 Query time은 낮음
Tree-based method, Graph Neural Network, Gaussian Process, 특수 encoding이 적용된 Neural Network

2. Learning Curve-based Method

부분적으로 학습된 network에 learning curve를 extrapolation하여 최종 performance를 예측
- Partial learning curve를 parametric model의 앙상블에 fitting하는 방식
- 관측된 training loss의 합산하는 방식
- Early stopping
Learning Curve-based predictor는 Initialization time은 짧지만 Query time이 느림
Hyperband, BOHB 같은 multi-fidelty 알고리즘과 같이 사용됨

3. Hybrid Method

Model-based와 Learning Curve-based method를 조합한 방식
- initialization 단계에서 architecutre $a$와 $a$의 partial learning curve를 모두 사용해 모델을 학습
Hybrid predictor는 Query time, Initialization time 모두 길지만, 예측 성능이 좋음
SVR, Bayesian Neural Network

4. Zero-cost Method

데이터의 미니배치에서 한번의 forward/backward propagation 결과를 활용하는 방식
Zero-cost predictor는 Initialization time이 없고 Query time도 짧음
네트워크 내의 activation의 상관관계 계산, Pruning을 활용한 Saliency metric을 적용하는 방식

5. Weight Sharing Method

Search space의 모든 architecture가 결합된 single over-parameterized Supernet을 사용
- Supernet의 weight 집합을 활용해 search space의 architecture를 평가
- Supernet을 performance predictor로 사용 가능
Weight Sharing predictor는 Query time은 짧지만, Initialization time이 어느 정도 소요됨

- Initialization time 과 Query time 간의 Trade-off

Initialization time, Quey time은 NAS 알고리즘 유형 및 실행시간 설정 등에 따라 달라짐
- For example,
  - 성능 예측이 필요한 architecture가 많은 경우 Query time이 짧아야 함
  - 실행시간이 넉넉한 경우 Initialization time을 길게 잡을 수 있음
NAS 알고리즘을 실행하는 동안 실행시간 기준을 변경할 수도 있음
- For example,
  - NAS 알고리즘 시작할 때 많은 수의 architecture에 대한 추정치를 얻고 싶은 경우
  : Initialization time, Query time이 짧은 predictor가 유용
  - NAS 알고리즘이 진행됨에 따라 얻어진 architecture에 대한 높은 정확도가 필요한 경우
  : Model-based method, Hybrid method 같은 정확한 predictor가 유용

3. Experiments

- Settings

datasets : NAS-Bench-101, Nas-Bench-201, Nas-Bench-301, Nas-Bench-NLP
Search space 별로 각 performance predictor에 대한 random search 수행
최대 실행시간 15분, 5000회 반복

- Performance Predictor Evaluation

Initialization time, Query time, Performance를 기준으로 predictor 평가
Pearson 상관 및 순위 상관 계수 (Spearman, Kendal Tau, sparse Kendal Tau) 사용해 성능 측정
31개의 Performance predictor 중에서 7개만이 Kendall Tau, Initialization time, Query time에 대해 Parteto-optimal인 것으로 나타남

Zero-cost Predictor
- Jacobian Covariance, SynFlow가 낮은 Initialization time, Query time에서도 높은 Kendall Tau 값을 보임
- Jacobian Covariance, SynFlow는 NAS-Bench-101/201에서 예측 결과를 계산하는데 3초밖에 걸리지 않음
- DARTS search space에서는 Zero-cost predictor들이 제대로 동작하지 않음
Weight Sharing Method
- Zero-cost method와 비슷하게 낮은 Initialization time, Query time을 보였지만, 높은 Kendall Tau 값을 얻지 못함
Learning Curve Extrapolation
- 낮은 Initialization time, 높은 Query time 측면에서 Sum of Training Losses (SoTL-E)가 일관되게 높은 Kendall Tau 값을 보임
GCN / SemiNAS
- NAS 연구에서 가장 관심높은 영역은 높은 Initialization time, 낮은 Query time
  -> 수천개의 후보 architecutre를 빠르게 Query하는 것이 중요하기 때문
- SemiNAS가 Initialization time도 상대적으로 짧으면서 높은 Kendall Tau 값을 보임
- 다른 Model-based Method들
  - Dataset이 충분하다면 더 잘 동작할 가능성이 있지만, Initialization time이 30시간이 지났음에도 Zero-cost predictor인 Jacobian Covariance, SynFlow가 모든 Model-based method를 능가했음
  -> 초기 반복 단계에서 Model-based method 대신 Jacobian Covariance를 쓰면 성능 개선의 효과가 있을 것

각 Search Space 별 높은 Kendall Tau 값을 보인 Performance Predictor

- Omnipotent Predictor

서로 다른 유형의 predictor가 특정 Initialization time 및 Query time에 특화되어 있음을 발견
-> 그렇다면 상호보완적인 각각의 predictor를 결합하면 더 나은 성능을 달성할 수 있을까?
- SoTL-E : 최고 성능의 Learning curve method
- Jacobian Covariance : 최고 성능의 Zero-cost method
- SemiNAS : 최고 성능의 Model-based method
- OMNI (Omnipotent predictor) = SoTL-E + Jacobian Covariance + SemiNAS

OMNI는 바로 다음으로 좋은 predictor보다 30% 높은 Kendall Tau 값을 보임
- 서로 다른 predictor들이 학습한 정보가 상호보완적인 것을 의미
- OMNI에 적용된 Learning Curve Extrapolation, Zero-cost Proxy, Architecture Encoding 모두 성능 향상에 영향을 줌
- SoTL-E는 학습 속도를 측정, Zero-cost predictor는 서로 다른 datapoint activation의 covariance를 측정, Model-based predictor는 architecture encoding과 validation accuracy 간의 패턴을 학습함

- Mutation-based Test

Neighborhood-based NAS 알고리즘은 architecture의 loacl perturbation을 고려해야 함
이때 사용되는 Predictor는 작은 seed architecture의 local mutation을 구별할 수 있어야 함
Zero-cost, Learning Curve method는 비슷한 성능을 보였지만, Model-based method는 낮은 성능을 보임
-> test set의 architecture간 average edit distance가 작기 때문에 구별이 어렵기 때문
Model-based method는 $10^{6}$초 이후 부터 성능 저하가 발생
-> SemiNAS는 앞선 결과들과 달리 특히 나쁜 성능을 보임
-> Boosting Tree는 비교적 강력한 성능을 보임

- Predictor-based NAS Experiments

Predictor-based NAS 알고리즘을 기반으로 NAS 속도를 높일 수 있는 Model-based predictor의 성능 평가
- Predictor-guided Evolution Framework
  - 모집단에서 최적의 architecture가 mutate되면서 후보 architecture를 생성
  - predictor는 전체 모집단에 대해 학습되고 $k$개의 architecture를 선택한 다음 성능을 평가
- Bayesian optimization Framework
  - 각 예측에 대한 uncertainty 추정치를 계산하기 위해서 3개의 performance predictor의 앙상블 사용
  - 각 반복에서 acquisition function을 최대화하는 후보 architecture에 대한 예측을 평가
짧은 Query time에서 높은 Kendall Tau 순위 상관이 있는 Model-based predictor가 NAS에 적용될 때 최적 성능 달성
OMNI를 사용했을 때도 좋은 성능을 보임
-> Model-based method와 Zero-cost method를 함께 사용하는 것이 유용

각 Model-based predictor에 대한 NAS framework 속도

- How Powerful are Performance Predictor?

실행시간이 긴 Model-based, Learning Curve predictor에 비해서 Zero-cost predictor의 사용이 효율적임
다양한 predictor들의 예측 정보는 상호 보완적임
Performance predictor 선택시 Initialization time과 Query time을 고려해 최상의 조합을 결정하는 것을 권장
- For example,
  - (가정) NAS-Bench 201, DARTS와 유사한 Search space
  - (가정) 중간정도의 Initialization time, 짧은 실행시간 제약
  -> Performance predictor = Jacobian Covariance + SynFlow + NBoost로 구성

전체 조합별 각 Performance Predictor들의 Kendall Tau

'Paper > NAS' 카테고리의 다른 글

[Paper 리뷰] Fast Hardware-aware Neural Architecture Search (0)	2023.06.27
[Paper 리뷰] Zen-NAS: A Zero-Shot NAS for High-Performance Image Recognition (0)	2023.06.19
[Paper 리뷰] HELP: Hardware-Adaptive Efficient Latency Prediction for NAS via Meta-Learning (0)	2023.06.17
[Paper 리뷰] MAE-DET: Revisiting Maximum Entropy Principle in Zero-Shot NAS for Efficient Object Detection (0)	2023.05.31
[Paper 리뷰] Once-For-All: Train One Network and Specialize it for Efficient Deployment (0)	2023.04.27

최근에 올라온 글

최근에 달린 댓글

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] How Powerful Are Performance Predictors in Neural Architecture Search?

How Powerful Are Performance Predictors in Neural Architecture Search?

1. Introduction

2. Performance Prediction Method for NAS

- NAS와 Performance Prediction

- Performance Predictor의 종류

- Initialization time 과 Query time 간의 Trade-off

3. Experiments

- Settings

- Performance Predictor Evaluation

- Omnipotent Predictor

- Mutation-based Test

- Predictor-based NAS Experiments

- How Powerful are Performance Predictor?

'Paper > NAS' 카테고리의 다른 글

티스토리툴바