[Paper 리뷰] EfficientSpeech: An On-Device Text to Speech Model

티스토리 뷰

Paper/TTS

[Paper 리뷰] EfficientSpeech: An On-Device Text to Speech Model

feVeRin 2023. 7. 14. 11:29

EfficientSpeech: An On-Device Text to Speech Model

최신 Text to Speech (TTS) 모델은 클라우드 사용을 염두에 두고 큰 메모리 공간과 많은 연산을 필요로 함
이러한 TTS 모델은 자원과 인터넷 액세스가 제한된 edge device에서 적용되기 어려움
EfficientSpeech
- Shallow non-autoregressive pyramid-structure transformer 기반의 U-Network 사용
- 기존 TTS 모델 크기의 1% 정도로 압축된 경량화된 음성 합성 모델
논문 (ICASSP 2023) : Paper Link

1. Introduction

클라우드 서비스에 의존하지 않고 독립된 음성 합성을 가능하게 하는 것은 많은 이점을 가져다줄 수 있음
- 프라이버시 문제 완화, Robustness, 높은 응답성, 낮은 latency 등
FastSpeech 2, FastPitch, Mixer-TTS 등의 음성 합성 모델의 등장
- 대부분 GPU, TPU 등의 AI accelerator를 기반으로 설계됨
- 최신 TTS 모델을 on-device에서 독립적으로 실행하기에는 어려움
  - Tacotron2, Deep Voice, TransformerTTS 같은 autoregressive 모델은 느림
  - FastSpeech2, Mixer-TTS 같은 non-autoregressive 모델은 빠르지만 큰 메모리 공간을 차지
LightSpeech, Nix-TTS 등의 on-device TTS 모델
- LightSpeech는 NAS를 활용해 경량 architecture를 설계
- Nix-TTS는 text-to-latent encoder, latent-to-waveform decoder를 별도로 학습하고 knowledge distillation을 적용
- Nix-TTS를 제외하고는 ARM CPU에서 효과적으로 작동하는지 검증되지 않음

-> 그래서 edge device에 적합한 경량 TTS 모델인 EfficientSpeech를 제안

< Overall of EfficientSpeech >

Shallow U-Network pyramid transformer를 Phoneme encoder로 사용
Shallow transposed convolutional block를 Mel-spectrogram decoder로 사용
266,000개의 적은 parameter 사용만으로 충분히 경쟁력있는 CMOS를 달성

2. Model Architecture

$x_{phone} \in R^{N \times d}$ : input text phoneme의 embedding
- $N$ : variable phoneme sequence length
- $d=128$ : embedding size
Phoneme encoder : 2개의 transformer 블록으로 구성
- 각 블록은 feature merging을 위한 depth-wise separable convolution, merged feature 간의 Self-attention, non-linear feature extraction을 위한 Mix-FFN으로 구성
  - 추가 convolution layer와 각 layer 사이의 GeLU activation을 제외하고는 일반적인 transformer FFN과 동일
  - Layer Normalization (LN)은 Self-attention과 Mix-FFN 이후 적용
  - Self-attention, Mix-FFN에는 빠른 수렴을 위한 residual connection 적용
- 첫 번째 transformer 블록은 sequence length를 유지하면서 feature dimension을 $\frac{1}{4}$로 감소
- 두 번째 transformer 블록은 sequence length를 절반으로 줄이고 feature dimension을 2배로 늘림
  - 각 transformer의 output feature는 linear layer와 transposed convolutional layer를 통해 upsample 됨
  - Identity layer는 $N \times \frac{d}{4}$ 크기의 target feature가 있는 경우 transposed convolution을 대체
  - 두 feature들이 fuse되어 최종적인 phoneme feature를 구성
- U-Network 스타일 구조를 통해 feature dimension과 sequence length를 줄여 FLOP과 parameter 수를 줄임
Acoustic features & Decoder : FastSpeech2의 variance adapter를 응용
- 각 network가 energy ($y_{e}$), pitch ($y_{p}$), duration ($y_{d}$)을 예측하도록 구성
  - Acoustic parameter를 직렬로 예측하지 않고 병렬로 생성하여 빠른 추론을 유도
- Energy, Pitch, Duration의 예측값은 2개의 Conv-LN-ReLU 블록과 최종 linear layer를 통해 생성
  - Binned energey와 Pitch feature는 마지막 layer에서 embed 되어 각각 $z_{e}$와 $z_{p}$ 생성
  - Duration은 ReLU activation 이전에 추출되어 $z_{d}$를 생성
Feature fuser & Upsampler : 모든 acoutstic feature는 phoneme feature와 fuse됨
- Fused feature는 예측 지속시간 $y_{d}$를 사용하여 Mel-sequence length $M$으로 upsample
Mel-spectrogram Decoder : 2개의 linear layer, 2개의 depth-wise separable convolution으로 구성
- 각 layer는 Tanh activation과 LN을 사용

- Model Training

Dataset : LJSpeech
- Montreal Force Alignment (MFA)를 사용하여 target phoneme duration을 설정
- Pitch와 Energy의 ground truth는 STFT와 WORLD vocoder를 통해 계산
Loss Function : $L = \alpha L_{mel} + \beta L_{p} + \gamma L_{e} + \lambda L_{d}$
- $L_{mel}$ : Mel-spectrogram loss function ($\alpha=10$인 L1 loss)
- $L_{p}, L_{e}, L_{d}$ : MSE loss
- $\beta=2, \gamma=2, \lambda = 1$

3. Experimental Results

EfficientSpeech는 266,000개의 parameter만을 사용하기 때문에 적은 FLOP 수를 보임
- 적은 FLOP으로 인해 V100 GPU에서 953.3 mRTF의 빠른 Mel-spectrogram 생성 속도를 보임

FastSpeech2와 비교했을 때 속도는 EfficientSpeech가 20.1배 더 빠름

CMOS 측면에서 음성 품질을 평가했을 때, EfficientSpeech는 작은 모델크기에도 불구하고 큰 차이를 보이지 않음

Vocoder를 적용했을 때 RTF가 느려지는 것으로 나타남
- HiFiGAN의 오버헤드는 5.0 GFLOPS인 반면, EfficientSpeech는 0.09 GFLOPS를 보임
  - EfficientSpeech는 오버헤드를 발생시키는 vocoder가 적용되어도 충분한 RAM 공간을 확보할 수 있음
- 저비용 저전력 device에서 EfficientSpeech를 적용하는 것이 효과적임

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] FastSpeech: Fast, Robust and Controllable Text to Speech (0)	2023.07.23
[Paper 리뷰] FastSpeech2: Fast and High-Quality End-to-End Text to Speech (0)	2023.07.21
[Paper 리뷰] Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech (0)	2023.07.17
[Paper 리뷰] Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation (0)	2023.07.15
[Paper 리뷰] LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search (0)	2023.07.13

최근에 올라온 글

최근에 달린 댓글

« 2025/01 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] EfficientSpeech: An On-Device Text to Speech Model

EfficientSpeech: An On-Device Text to Speech Model

1. Introduction

2. Model Architecture

- Model Training

3. Experimental Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바