[Paper 리뷰] StreamSpeech: Low-Latency Neural Architecture For High-Quality On-Device Speech Synthesis

티스토리 뷰

Paper/TTS

[Paper 리뷰] StreamSpeech: Low-Latency Neural Architecture For High-Quality On-Device Speech Synthesis

feVeRin 2023. 12. 13. 11:16

StreamSpeech: Low-Latency Neural Architecture For High-Quality On-Device Speech Synthesis

Text-to-Speech (TTS) 모델의 추론 latency와 real-time factor (RTF)는 GPU와 같은 특수 hardware가 없는 상황에 배포하기에는 여전히 높음
StreamSpeech
- Single CPU를 활용한 resource 제약 환경에서 고품질, 실시간 합성을 가능하게 하는 TTS architecture
- Streaming과 low-latency generation을 가능하게하는 경량 convolutional acoustic decoder의 도입
논문 (ICASSP 2023) : Paper Link

1. Introduction

Neural TTS 모델은 고품질의 음성 합성이 가능하지만 여전히 높은 계산량을 보임
- FastPitch, FastSpeech2, LPCNet 등은 TTS의 합성 효율성을 향상시켰지만, 여전히 GPU, TPU와 같은 특수 hardware를 필요로 함
- 높은 추론 latency와 RTF는 resource가 제한된 환경이나 특수 hardware가 없는 device에 대해 배포하기 어렵게 함
TTS 모델의 실시간성 향상을 위해, acoustic model과 vocoder network 최적화에 대한 연구가 진행됨
- LightSpeech는 Neural Architecture Search (NAS)를 활용해 최적의 경량 architecture를 탐색함
  - 계산 효율성을 향상할 수 있는 depth-wise separable convolution (SepConv)의 도입
- DurlAN, FeatherWave와 같은 LPCNet의 변형 모델들
  - Multi-band processing을 도입하여 병렬적으로 음성을 합성
- Low-latency on-device TTS 모델에 대한 연구는 더욱 제한적임

-> 그래서 resource 제약 환경에서 고품질의 실시간 합성이 가능한 최적 TTS architecture인 StreamSpeech를 제안

StreamSpeech
- TTS 파이프라인을 operating resolution에 따라 3단계로 나누어 개별적으로 최적화
  1. Character-level (Text analysis, FastSpeech2 encoder)
  2. Frame-level (FastSpeech2 decoder)
  3. Sample-levle (LPCNet)
- FastSpeech2 encoder의 계산 복잡도를 줄이기 위한 depth-wise separable convolution의 도입
- Non-autoregressive FastSpeech2 decoder를 lightweigth streamable convolution decoder로 대체
- LPCNet 효율성 향상을 위해 multi-band processing과 hierarchical softmax를 도입

< Overall of StreamSpeech >

FastSpeech2와 LPCNet에 기반을 둔 on-device TTS 모델
Recurrent network나 attention의 사용 없이 multi-thread를 통해 병렬로 실행하여 vocoder의 효율성을 향상

2. Baseline Architecture

Text Analysis
- Text Analysis module은 text noramlization, verbalization을 수행하는 역할
  - Contextual rule과 dictionary를 활용
- Transformer와 rule을 결합해 grapheme-to-phoneme conversion을 수행
- Rule과 dictionary는 26개의 finite-state transducer의 cascade로 구성됨
Acoustic Model
- FastSpeech2 architecture를 기반으로 활용
- 4개의 Feed-Forward Transformer (FFT) 블록, duration / pitch / energy predictor로 구성된 encoder
- 4개의 FFT 블록, spectrogram projection으로 구성된 decoder
Vocoder
- LPCNet architecture를 기반으로 활용
- 5개의 $1 \times 3$ 1D convolution stack, tahn activation

3. StreamSpeech Architecture

- Performance Analysis

Baseline에 대한 latency와 RTF를 profiling
- 총 latency는 x86 CPU에서 0.55초, A76 CPU에서 2.2초로 측정됨

Baseline의 높은 latency는 text analysis, FastSpeech2 encoder/decoder가 원인
- Streaming이 어려운 global attention mechanism을 사용하기 때문
- 특히 latency의 3/4는 frame-level에서 동작하는 FastSpeech2 decoder에서 발생
  -> 따라서 FastSpeech2 decoder를 streamable하게 구성하여 latency를 줄이는 것을 목표로 함
추론 속도 측면에서, 계산 부하의 3/4는 vocoder로 인해 발생
-> 따라서 LPCNet vocoder에 대한 효율성 개선을 통해 추론 속도 향상을 높일 수 있음

Baseline architecture에 대한 Latency, RTF Profiling

- Acoustic Model Optimization

Encoder
- FastSpeech2 encoder는 speech prosody를 모델링하고 long-term dependency를 capture 하기 위해 multi-head self-attention를 활용함
  - Long utterance의 prosody를 잘 모델링할 수 있도록 non-streaming arhictecutre를 유지
  - 효율성 향상을 위해 FFT block과 predictor를 depth-wise separable convolution으로 대체
- Depth-wise separable convolution으로 encoder의 구성을 대체했을 때,
  - Parameter 수 3.8배 감소, RTF 3.3배 감소
- 대체된 encoder의 경우 low character-level operationg resolution으로 인해 추가적인 최적화나 streaming은 불필요

StreamSpeech 전체 architecture에 대한 Latency, RTF Profiling

Decoder
- Encoder와 동일하게 depth-wise separable convolution으로 대체했을 때,
  - Latency 및 RTF는 2배 이상 향상되지만 전체 system의 latency는 A76에서 1초 보다 크게 측정됨
  - Decoder의 경우 추가적인 architecture 수정이 필요함
- FastSpeech2에서 decoder는 음성의 coarticulation phenomena를 모델링함
  - Coarticulation은 local occurrence이기 때문에, global information을 필요로 하지 않음
  - Attention을 통해 global information을 추출하는 FFT block을 local context를 capture하는 convolution block으로 대체
- StreamSpeech의 decoder는 Conformer의 convolution module과 feed-forward block을 활용
  - Natural speech에서 relative offset-based local correlation을 효과적으로 capture 할 수 있기 때문
- 추가적으로 decoder는 feed-forward와 convolution으로 구성되어 있어 streaming 방식으로 동작할 수 있음
  - Decoder의 latency는 encoder의 출력을 입력받은 후로부터 첫 번째 mel-spectrogram을 생성하는데 까지 걸리는 시간을 의미
  - Streaming rate를 parameterize하여 latency와 RTF의 균형을 유지
  -> $rate=6$을 사용했을 때, $rate=1$에 비해 latency가 크게 증가하지 않으면서 최적의 RTF를 달성했음

StreamSpeech Decoder에 대한 Latency, RTF Profiling

Vocoder Optimization
- LPCNet vocoder는 sample by sample로 audio signal을 예측하므로 audio 생성 속도가 느림
  - Multi-band parallel generation 방식을 도입하고 multiple dual fully-connected 및 softmax layer를 통해 모든 subband의 excitation을 예측
- Multi-band LPCNet의 성능을 보면, 계산 부하의 45%는 dual fully-connected layer와 softmax activation에 집중됨
  - Hierarchical softmax approach를 도입해 각 frequency band의 output distribution을 8-level binary tree로 표현하고 각 branch의 확률은 sigmoid로 계산
  -> Sampling 과정에서 evaluate되는 dual fully-connected layer의 개수를 256개에서 8개로 줄임
- Hierarchical softmax를 도입한 경우 x86에서 36%, A76에서 44%의 RTF 향상이 가능

4. Experiment

- Performance Evaluation

StreamSpeech는 x86에서 79ms (0.155 RTF), A76에서 276ms (0.289 RTF)를 달성
Utternace duration과 latency의 관계를 비교했을 때, StreamSpeech는 long utterance에 대해서도 낮은 latency를 달성

- Quality Evaluation

Mel Cepstral Distortion (MCD) 측면에서 StreamSpeech (아래 표의 HSM)은 Baseline과 동일한 성능을 보이지만 F0는 향상됨
- Vocoder의 multi-band process를 통해 loss function이 lowest band에 대한 추가적인 term을 얻기 때문

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech (0)	2023.12.17
[Paper 리뷰] FastPitch: Parallel Text-to-Speech with Pitch Prediction (0)	2023.12.14
[Paper 리뷰] FastSpeech: Fast, Robust and Controllable Text to Speech (0)	2023.07.23
[Paper 리뷰] FastSpeech2: Fast and High-Quality End-to-End Text to Speech (0)	2023.07.21
[Paper 리뷰] Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech (0)	2023.07.17

최근에 올라온 글

최근에 달린 댓글

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] StreamSpeech: Low-Latency Neural Architecture For High-Quality On-Device Speech Synthesis

StreamSpeech: Low-Latency Neural Architecture For High-Quality On-Device Speech Synthesis

1. Introduction

2. Baseline Architecture

3. StreamSpeech Architecture

- Performance Analysis

- Acoustic Model Optimization

4. Experiment

- Performance Evaluation

- Quality Evaluation

'Paper > TTS' 카테고리의 다른 글

티스토리툴바