[Paper 리뷰] LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

티스토리 뷰

Paper/TTS

[Paper 리뷰] LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

feVeRin 2023. 7. 13. 11:24

LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

Text to Speech (TTS) 모델을 다양한 device에 배포하기 위해서는 적은 메모리와 추론 latency를 가져야 함
Non-autoregressive TTS 모델을 통해 빠른 추론 속도를 달성했지만 여전히 자원 제약이 있는 device에 배포하기 어려움
LightSpeech
- Neural Architecture Search (NAS)를 활용한 FastSpeech 기반의 자동 network 설계
- 다양한 lightweight architecture를 포함하는 새로운 search space 구성
논문 (ICASSP 2021) : Paper Link

1. Introduction

TTS 모델은 일반적으로 추론 latency가 큰 autoregressive 방식을 활용함
- Non-autoregressive TTS 모델은 추론 속도를 크게 가속화할 수 있음
- 제한된 자원을 가진 device에 배포할 때 여전히 큰 모델 크기와 latency를 가짐
Shrinking, Tensor decomposition, Quantization, Pruning과 같은 lightweight architecture 설계 방식
- 적은 계산 비용으로 큰 모델을 작은 모델로 압축
  - 대부분 CNN 설계에 초점을 맞추고 있어 시퀀스 작업(자연어, 음성처리)으로 확장하기 어려움
- NAS는 lightweight 모델을 설계하는데 효과적임
  - 하지만 적절한 search space, search algorithm 구성이 어려움

->그래서 빠른 추론 latency를 보장하는 NAS 기반의 lightweight TTS 모델 설계방식을 제안

NAS 적용 시 고려사항
1. Search space
  : Search space는 유망한 architecture를 탐색할 수 있는 범위를 결정함
2. Search algorithm
  : 각 task에 적합한 특성을 가진 algorithm을 선택해야 함
3. Evaluation metric
  : Architecture의 성능을 평가하고 NAS에 적용할 수 있는 적합한 metric이 필요함

< Overall of LightSpeech >

빠른 추론 latency를 보장하는 NAS를 활용한 lightweight TTS 모델 설계
TTS 모델의 bottleneck 분석을 통한 search space 구성
Search space에 적합한 GBDT-NAS 기반 NAS algorithm

2. Method

- Profiling the Model

FastSpeech 2를 모델의 backbone으로 채택
- 빠르고 고품질의 음성 합성이 가능한 Non-autoregressive TTS 모델
  - Non-autoregressive TTS 모델은 병렬로 음성을 생성하고 autoregressive 모델보다 빠른 추론 속도를 가짐
- 제한된 컴퓨팅 자원을 가진 device에서 높은 메모리 사용량과 추론 latency가 발생
  - 100M memory, 27M parameters, CPU에서 GPU보다 10배 느림
FastSpeech 2의 구성
- Encoder, Decoder, Duration predictor, Pitch predictor, Energy predictor
  1. Encoder, Decoder : 각각 4개의 feed-forward Transformer 블록으로 구성
  2. Duration predictor : kernel size 3인 2-layer 1D CNN으로 구성
  3. Pitch predictor : kernel size 5인 5-layer 1D CNN으로 구성
  4. Energy predictor : kernel size 5인 5-layer 1D CNN (Pitch predictor와 동일한 구조)
- Encoder와 Decoder가 모델 크기와 추론 시간의 대부분을 차지함
  - Encoder, Decoder의 크기를 줄이고 효율적인 architecture를 발견하기 위해 NAS가 적용됨
- Predictor는 전체 추론 시간과 모델 크기의 1/3을 차지함
  - NAS를 적용하는 대신 variance predictor를 설계하여 lightweight operation으로 대체

- Search Space Design

Encoder와 Decoder 모두 4개의 feed-forward Transformer 블록으로 구성됨
- 각 feed-forward Transformer 블록은 Multi-head self-attention (MHSA) layer와 Feed-forward network (FFN)을 포함
  - 해당 Encoder-Decoder 구조를 network backbone으로 사용하고 layer 개수를 4로 설정
- Duration, Pitch, Energy에 대한 variance predictor는 음성 품질의 저하를 일으키기 때문에 search space에서 제거됨
Encoder-Decoder 구조에 대한 다양한 architecture 조합
1. LSTM은 느린 추론 속도로 인해 고려되지 않음
2. 기존의 Transformer 블록에서 MHSA와 FFN을 별도의 operation으로 분리
  - MHSA search space : $\{2, 4, 8 \}$의 서로 다른 attention head
3. Vanilla convolution 보다 메모리 효율이 높은 Separable convolution (SepConv)의 사용
  - Vanilla convolution의 parameter 크기 : $K \times I_{d} \times O_{d}$
  : $K$ : kernel size, $I_{d}$ : input dimension, $O_{d}$ : output dimension
  - SepConv의 parameter 크기 : $K \times I_{d} + I_{d} \times O_{d}$
  - SepConv search space : $\{1,5,9,13,17,21,25 \}$의 kernel size
4. Candidate operation의 개수 : $3+7+1 = 11$
  - MHSA (3) + SepConv (7) + FFN (1)
  - 구성된 search space로부터 가능한 candidate architecture 조합 : $11^{4+4} = 11^{8} = 214358881$
Variance predictor (Durarion, Pitch)는 동일한 kernel size의 SepConv로 대체됨

- Search Algorithm

구성한 chain-structure search space에 적합한 accuracy 예측 기반의 search algorithm 채택
- 일부 architecture-accuracy pair를 통해 학습된 Gradient Boosting Decision Tree (GBDT)를 사용하여 candidate architecture의 accuracy를 예측
- 예측된 accuracy가 가장 높은 architecture는 hold-out dev set으로 추가 평가하여 가장 높은 성능의 architecture를 선택
TTS의 음성 품질 평가는 human labor를 요구하기 때문에 search 과정에서 각 candidate architecture를 평가하는 것은 적합하지 않음
- Search 과정에서는 dev set의 validation loss를 accuracy proxy로 사용함
- 결과적으로 Validation loss가 가장 작은 architecture를 선정

3. Experiments

- Settings

Dataset : LJSpeech
Search Configuration : GBDT-NAS
Training & Inference : NVIDIA P40 GPU

- Results

Audio Quality
- 음성 품질 평가를 위해 CMOS 평가를 수행
- LightSpeech를 통해 얻어진 모델이 성능 저하 없이 FastSpeech2와 비교할만한 음성 품질을 달성
  - Parameter 수는 LightSpeech가 1.8M으로 훨씬 적고 27M의 FastSpeech2와 비교해 15배의 압축률을 달성
- 수동 설계된 FastSpeech2*는 LigthSpeech와 모델 크기가 비슷하지만 CMOS 성능이 크게 떨어짐
  - Lightweight TTS 모델에 있어서 자동 설계의 이점을 보임

Speedup and Complexity
- 속도 측면에서 RTF를 비교해 보면 LightSpeech에서 CPU 속도는 $9.3 \times 10^{-3}$으로 6.5배 빨라짐
- Computation complexity 측면에서 MAC을 비교해보면 LightSpeech는 FastSpeech2보다 16배 적음
  - LightSpeech를 자원이 제약된 환경(mobile, embedded device)에서 배포하는 것이 가능함

- Study and Analysis

Shallowing
- FastSpeech2*는 Encoder, Decoder의 feed-forward Transformer 블록이 2개로 얕게 구성되어 있음
  - 모델 크기가 1.8M으로 감소하지만 음성 품질이 -0.230 CMOS로 크게 떨어짐
- 단순히 Transformer 블록을 얕게 구성하면 성능이 저하됨
SepConv
- Variance predictor를 SepConv로 대체했을 때 loss가 크게 감소하지 않음
- SepConv를 사용하면 모델 capacity를 유지하면서 크기를 줄일 수 있음

Search Space and NAS
- Search Space를 기반으로 무작위로 설계된 architecture가 수동으로 설계된 FastSpeech2*보다 낮은 loss를 보임
  - Search Space는 수동으로 설계하는 것보다 더 가능성 높은 architecture를 보유하고 있음
- NAS를 통해 얻어진 architecture는 FastSpeech2와 비슷한 loss를 달성할 수 있음

- Discovered Architecture

Encoder : SepConv ($K=5$), SepConv ($K=25$), SepConv ($K=13$), SepConv ($K=9$)
Decoder : SepConv ($K=17$), SepConv ($K=21$), SepConv ($K=9$), SepConv ($K=13$)
- $K$ : kernel size
- Hidden size : 256

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] FastSpeech: Fast, Robust and Controllable Text to Speech (0)	2023.07.23
[Paper 리뷰] FastSpeech2: Fast and High-Quality End-to-End Text to Speech (0)	2023.07.21
[Paper 리뷰] Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech (0)	2023.07.17
[Paper 리뷰] Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation (0)	2023.07.15
[Paper 리뷰] EfficientSpeech: An On-Device Text to Speech Model (0)	2023.07.14

최근에 올라온 글

최근에 달린 댓글

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

1. Introduction

2. Method

- Profiling the Model

- Search Space Design

- Search Algorithm

3. Experiments

- Settings

- Results

- Study and Analysis

- Discovered Architecture

'Paper > TTS' 카테고리의 다른 글

티스토리툴바