[Paper 리뷰] HELP: Hardware-Adaptive Efficient Latency Prediction for NAS via Meta-Learning

티스토리 뷰

Paper/NAS

[Paper 리뷰] HELP: Hardware-Adaptive Efficient Latency Prediction for NAS via Meta-Learning

feVeRin 2023. 6. 17. 12:27

HELP: Hardware-Adaptive Efficient Latency Prediction for NAS via Meta-Learning

배포와 device별 제약 조건을 만족하기 위해 Neural Architecture Search (NAS)는 hardware를 고려해야 함
기존에는 Lookup table / Latency estimator를 통해 target device의 latency sample을 수집
-> 서로 다른 사양의 hardware를 많이 요구하기 때문에 비실용적임
Hardware-adaptive Efficient Latency Predictor (HELP)
- Device별 latency 추정 문제를 meta-learning 문제로 공식화
- Latency를 출력하는 black-box를 고려해 모든 device를 embedding 하는 hardware embedding 방식을 제안
- Hardware embedding을 사용해 Hardware-adaptive Latency Predictor를 meta-learning
논문 (NeurIPS 2021) : Paper Link

1. Introduction

NAS가 실제 시나리오에서 적용되기 위해서는 hardware를 고려해야 함
메모리, latency, 에너지 소비 등 다양한 device 제약 조건을 만족하는 architecture를 찾아야 함
Latency가 efficieny 제약으로 가장 많이 고려됨
- Latency를 측정하기 위해서는 target device의 architecture-latency pair를 수집해야 하므로 비용이 많이 소모됨
- OFA는 고성능의 supernet을 활용해 모델의 훈련 시간을 줄이지만, 각 device별로 latency predictor를 구축해야 함
- BigNAS는 FLOP을 기준으로 efficieny를 측정하지만, 정확한 efficieny 기준이라고 볼 수 없음
- Device의 종류와 task에 따라 여러 시간이 소요되므로 latency-constrained NAS의 병목으로 작용함

-> 그래서 효율적인 latency predictor인 HELP를 제안

알려진 device를 통해 학습한 knowledge를 새로운 device로 transfer 하여 모르는 device에서 latency 추정 시 높은 효율성을 달성
- Latency prediction 문제를 few-shot 회귀 문제로 고려해 주어진 architecture-device pair에서 latency를 추정
- 새로운 hardware embedding 방법 제안
  - 각 device에서 참조 architecture의 latency를 활용해 모든 device를 embedding
- 제안된 hardware embedding을 활용해 여러 device에 걸쳐 latency predictor를 학습
  - amortized meta-learning, gradient-based meta-learning을 결합
Device에 구애받지 않는 hardware embedding을 통해 모든 hardware platform 및 architecture search space에 적용가능
NAS framework와 HELP를 결합하여 latencty-constrained NAS의 컴퓨팅 병목 현상을 줄일 수 있음
- MetaD2A, OFA, HAT와 결합했을 때 효과적임

< Overall of HELP (Hardware-adaptive Efficient Latency Predictor) >

Latency 추정을 few-shot 회귀로 공식화해 주어진 architecture-device pair로부터 latency를 출력
참조 architecture의 latency에 따라 device를 embedding 하는 device-agnostic hardware embedding을 제안
새로운 device의 latency를 추정하기 위해 few-shot 회귀 모델을 meta 학습하는 HELP를 제안
HELP를 기존 NAS framework와 결합해 latency-constrained NAS의 병목 현상을 해소

2. Method

- Problem Definition

(목표) Device에서 적은 수의 sample만을 사용하여 architecture-device pair의 latency를 예측하는 모델을 설계하는 것
Task specification $\tau = \{ h^{\tau}, \textbf{X}^{\tau}, \textbf{Y}^{\tau} \}$가 주어졌을때, 주어진 hardware device $h$에 대한 neural architecture $x$의 latency $y$를 추정하는 회귀 모델 $f(x; \theta) : \textbf{X} \rightarrow \mathbb{R}$를 학습
- $h^{\tau} \in \textbf{H}$ : Hardware device
- $X^{\tau} \subset \textbf{X}$ : Neural architecture 집합
- $Y^{\tau} \subset \textbf{Y}$ : $\textbf{X}^{\tau}$의 latency 집합
- 예측값 $f(\textbf{X}^{\tau}; \theta)$와 실제 측정값 $\textbf{Y}^{\tau}$에 대한 Loss $L$을 최소화 : $min_{\theta}L(f^{\tau}(\textbf{X}^{\tau}; \theta), \textbf{Y}^{\tau})$
BUT, 회귀 모델을 학습하는 것은 간단하지 않음
1. 여러 device에 걸쳐 일반화할 수 없기 때문에 $N$개의 device에 대해 $N$개의 predictor $\{ f^{\tau}(\cdot; \theta^{\tau}) \}^{N}_{\tau=1}$을 개별적으로 학습하고 sample을 수집해야 함
2. 회귀 모델을 overfit 하지 않고 신뢰할만한 예측 성능을 얻으려면 각 device 별로 많은 수의 architecture-latency pair를 필요로 함
3. Device와 architecture에 대한 일반화가 부족하기 때문에 NAS framework는 새로운 device가 제공될 때마다 sample 수집을 반복해야 하므로 시간 소모적임
Single predictor $f(\cdot; \theta)$의 사용
- 적은 수의 architecture-latency 만을 수집하여 ( $X^{\tau} \ll \textbf{X}^{\tau}, Y^{\tau} \ll \textbf{Y}^{\tau}$ ) 새로운 target device와 architecture에 빠르게 일반화 가능
- Meta-learning framework를 통해 device와 architrcture pool $p(\tau)$에서 얻은 knowledge를 transfer 함

- Hardware-adaptive Latency Prediction with Device Embedding

측정된 latency $y \leftarrow (x,h)$는 device type $h$와 architecture $x$에 따라 달라짐
- 기존의 latency predictor는 device별로 개별적으로 학습되기 때문에 device 제약조건을 무시하고 $f(x;\theta)$ 형식을 취함
  - 새로운 device를 포함하는 환경에서 Single latency predictor를 학습하면 성능이 저하됨
- Hardware-conditioned prediction model이 필요 : $f(x, h;\theta) : \textbf{X} \times \textbf{H} \rightarrow \mathbb{R}$
Hardware-conditioned prediction model
- 동일한 architecture $x$에 대해서도 device type에 따라 latency를 다르게 예측 가능
- Platform type에 관계없이 모든 device에 대해 hardware device $h$를 표현하는 것이 중요
  - Hardware device의 물리적 architectrue가 다를 수 있기 때문 (CPU, FPGA..)
- Hardware device를 주어진 architecture에서 추론된 latency을 출력하는 black-box function으로 고려
고정된 참조 neural architecture 집합에 대한 device의 latency : $V_{h}$
- $E$ : 참조 neural architecture $\{ x_{1}, x_{2}, ... , x_{d} \} \subset \textbf{X}$의 집합
  - meta-training, meta-test 작업 전반에서 고정되어 사용됨
- $d$ : 참조 architecture의 수
- $y^{*}_{i}(x_{i}, h) = \{ y_{i}(x_{i}, h) - min(V^{(0)}_{h}) \} / \{ max(V^{(0)}_{h}) - min(V^{(0)}_{h}) \}$ : 표준화된 latency 값 (0~1)
  - 이때, $V^{(0)}_{h} = \{ y_{1}(x_{1}, h), y_{2}(x_{2}, h), ... , y_{d}(x_{d}, h) \}$
- 참조 device 집합은 대표성이 있어야 하므로 다양하고 이질적인 것으로 선택
  - 참조 architecture는 search space에서 random sampling

Hardware device에 대한 black-box 처리 및 latency-based hardware embedding을 통해 자세한 hardware specification을 고려하지 않고 새로운 device를 embedding 가능

- Meta-Learning the Hardware-adaptive Latency Predictor

수집된 device 및 architecture pool $p(\tau)$를 활용하여 여러 device에 대한 few-shot 회귀 문제를 해결해야 함
- Meta-learning을 활용한 hardware-adaptive latency predictor
  - Task distribution $p(\tau)$에서 $f(x, h; \theta)$를 meta-learning
  - Predictor $f(x, h; \theta^{\tau})$를 새로운 task specification $\tau = \{ h^{\tau}, V_{h}, X^{\tau}, Y^{\tau} \}$가 주어졌을 때 neural architecture $x$에 빠르게 적응시킴
Meta-training 단계에서는 Episodic Training Strategy 활용
- Device-architecture pool $p(\tau)$에서 task $\tau$를 random sampling 하여 각 반복마다 few-shot regression 수행
  - Training set $D = \{ h^{\tau}, X^{\tau}, Y^{\tau} \}$ / Test set $\tilde{D} = \{ h^{\tau}, \tilde{X}^{\tau}, \tilde{Y}^{\tau} \}$
  - $X^{\tau} \subset \textbf{X}$, $\tilde{X}^{\tau} \subset \textbf{X}$ : neural architecture sample의 집합
  - $X^{\tau}$ : few-shot sample의 집합 ($|X^{\tau}| \ll |\textbf{X}|$)
  -> 이 집합 간에는 교집합이 없음 ($X^{\tau} \bigcap \tilde{X}^{\tau} = \emptyset $)
  - $Y^{\tau} \subset \textbf{Y}$, $\tilde{Y}^{\tau} \subset \textbf{Y}$ : neural architecture $X^{\tau}$, $\tilde{X}^{\tau}$에 대응하는 latency 값의 집합
- Hardware-adaptive 예측 모델 : $f(X, V^{\tau}; \theta^{\tau})$

Latency predictor를 Meta-training 하기 위한 test loss $L(\cdot; \tilde{D}^{\tau})$
- Task embedding $V^{\tau}_{h}$를 사용해서 task-conditioned latency predictor를 얻을 수 있음
- Amortized meta-learning framework의 사용 : dataset-adaptive performance predictor와 NAS framework를 meta-learn 하는 것이 목표

Target device에서 수집된 latency를 사용해서 few-shot adaptation을 추가로 수행
- $t$ : $t$번째 inner gradient step
- $T$ : inner gradient step의 총 수
- $\alpha$ : multi-dimensional global learning rate vector
- Meta-learning에서 사용되는 device에 대한 knowledge를 사용하여 새로운 device에 빠르게 적응 가능
- Meta-device pool과 관련성이 없는 완전히 새로운 device를 접하게 되면 $\theta^{\tau}_{(t)}$에 의해 기존에 캡처된 meta-knowledge에서 벗어나게 됨

Few-shot adaptation을 위한 Inner gradient update

Hardware-adaptive modulator : $Z^{\tau} = g(V^{\tau}_{h}; \phi) : \mathbb{R}^{d} \rightarrow \mathbb{R}^{d_{\theta}}$
- Initial parameter를 $\theta_{(0)} = \theta * z^{\tau}$로 modulate 하기 위함
- $\theta_{(0)}$ : hardware $h^{\tau}$에 대한 새로운 initialization
- Weight 설정 : $\theta_{(0)} \leftarrow \theta \circ z^{\tau}$ ($\circ$ : element-wise multiplication operator)
- Bias 설정 : $\theta_{(0)} \leftarrow \theta + z^{\tau}$

Hardware-adaptive modulator를 반영한 Inner gradient update rule

최종적인 Meta-learning objective function
- Hardware-adaptive latency predictor와 shared initial parameter에 대한 modulator를 모두 meta-learn

- Few-shot Adaptation to Unseen Devices (Meta-Test)

Meta-train 된 latency prediction model $f(\cdot;\theta)$를 통해 새로운 device $h_{\nu}$에서 architecture의 device 추정
- Device-conditioned meta-learning을 활용하면 few latency 값 만으로도 새로운 architecture $\tilde{x}^{\nu}$의 latency $y^{\nu}$를 측정할 수 있음
- For example,
  1. (Given) 새로운 device $h^{\nu}, \nu = \{ h^{\nu}, X^{\nu} \}$에서 architecture $\tilde{x}^{\nu}$의 latency 예측
  2. Hardware embedding $V_{h}$ 계산 : 고정된 참조 architecture 집합에서 latency를 얻음
  3. 얻어진 $V_{h}$를 Latency predictor $\theta^{\nu}_{(T+1)}$의 device-optimized parameter로 활용
  4. Device-optimized latency predictor $f(\cdot, V_{h^{\nu}};\theta^{\nu}_{(T+1)})$을 사용해 $\tilde{x}^{\nu}$의 latency 예측
- Meta latency predictor를 NAS method와 결합하여 새로운 device에 대해 latency가 제한된 NAS를 수행 가능

- Computational Complexity of HELP

Latency predictor의 meta-training은 한 번만 수행되고 얻어진 latency predictor를 조정해서 사용할 수 있음
기존 방식은 많은 양의 sample을 수집해 각 target device별로 latency predictor를 학습시켜야 하지만, HELP는 device 당 10개의 sample만 수집하면 됨
- Time complexity를 $O(DN)$ 에서 $O(N)$으로 줄임
- $D$ : device 수
- $N$ : sample 수

3. Experiment

- Settings

Search space : NAS-Bench-201, FBNet, MoblieNetV3, HAT
Meta-training / Meta-test pool : 18 heterogeneous device (CPU, GPU, mobile..)
Baseline : MAML, Meta-SGD, ANP, BRP-NAS, MetaD2A

- Efficacy of HELP on Few-shot Latency Estimation for Novel Devices

Meta-knowledge (Meta-SGD, HELP)를 사용하는 것이 처음부터 Latency predictor를 학습하는 것보다 우수한 성능을 보임
- 특히 HELP는 샘플 수가 적을 때 다른 방식들보다 훨씬 높은 성능 보임
- Hardware-adaptive initial parameter $\theta_{0}$을 사용하기 때문

Latency Estimation Performance for Unseen Devices

HELP가 다른 hardware-independent 한 방식들보다 우수한 성능을 보임
- Meta-training dataset에서 device의 이질성으로 인해 task conditioning이 더 중요해지기 때문

Effect of the Hardware-adaptive Meta-learning

Meta-training pool에 있는 device 수가 10개 이상일 때, 새로운 target device에 대해 0.9 이상의 상관관계를 달성

HELP는 10개의 sample 만으로 GPU에서 0.987, CPU에서 0.989의 Spearman 순위 상관을 얻어 최고의 sample 효율성을 보임

- End-to-end Latency-constrained NAS with HELP

HELP가 Pixel2, Titan RTX에서 각각 90배의 샘플 효율성과 9.8배, 9.9배의 계산 효율성을 달성해 최고의 성능을 보임
- HELP+MetaD2A는 125초 만에 최적의 Latency-constrained architecture를 탐색했음

여러 Latency predictor와 MetaD2A를 결합한 Latency-constrained NAS 결과

HELP가 각 traget device에 대해 총 NAS 비용을 크게 줄여줌

ImageNet, MobileNet search space에서 Latency-constrained NAS 결과

OFA+HELP는 target device의 총 NAS 비용을 2140배까지 줄임

Transformer 모델에 대한 NAS를 수행했을 때, HELP는 기존의 HAT보다 200배 더 적은 sample을 사용하면서 경쟁력 있는 Transformer 모델을 얻었음

Hardware-aware Transformer Architecture Search

'Paper > NAS' 카테고리의 다른 글

[Paper 리뷰] Fast Hardware-aware Neural Architecture Search (0)	2023.06.27
[Paper 리뷰] Zen-NAS: A Zero-Shot NAS for High-Performance Image Recognition (0)	2023.06.19
[Paper 리뷰] MAE-DET: Revisiting Maximum Entropy Principle in Zero-Shot NAS for Efficient Object Detection (0)	2023.05.31
[Paper 리뷰] How Powerful Are Performance Predictors in Neural Architecture Search? (0)	2023.05.28
[Paper 리뷰] Once-For-All: Train One Network and Specialize it for Efficient Deployment (0)	2023.04.27

최근에 올라온 글

최근에 달린 댓글

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] HELP: Hardware-Adaptive Efficient Latency Prediction for NAS via Meta-Learning

HELP: Hardware-Adaptive Efficient Latency Prediction for NAS via Meta-Learning

1. Introduction

2. Method

- Problem Definition

- Hardware-adaptive Latency Prediction with Device Embedding

- Meta-Learning the Hardware-adaptive Latency Predictor

- Few-shot Adaptation to Unseen Devices (Meta-Test)

- Computational Complexity of HELP

3. Experiment

- Settings

- Efficacy of HELP on Few-shot Latency Estimation for Novel Devices

- End-to-end Latency-constrained NAS with HELP

'Paper > NAS' 카테고리의 다른 글

티스토리툴바