[Paper 리뷰] SpeedySpeech: Efficient Neural Speech Synthesis

티스토리 뷰

Paper/TTS

[Paper 리뷰] SpeedySpeech: Efficient Neural Speech Synthesis

feVeRin 2024. 1. 17. 12:33

SpeedySpeech: Efficient Neural Speech Syntheis

Neural Text-to-Speech는 음성 합성의 품질을 크게 향상했지만, 여전히 추론 및 학습 속도가 느림
SpeedySpeech
- 계산 resource 요구사항이 적고, 빠른 spectrogram 합성이 가능한 student-teacher network
- 고품질 audio 생성에 self-attention layer가 필요하지 않다는 점을 이용
- Residual connection이 있는 간단한 convolution을 활용하고 teacher model에 대해서만 attention layer를 적용
논문 (INTERSPEECH 2020) : Paper Link

1. Introduction

최신 Neural Text-to-Speech (TTS)는 음성 품질을 상당히 개선했지만, 일반적으로 많은 resource를 필요로 함
- 특히 Tacotron2와 같은 Sequence-to-Sequence 방식은 실시간성에 제한이 많음
- 빠른 학습/추론 속도와 ouput 품질 사이에는 trade-off가 여전히 존재함

-> 그래서 합성 품질을 유지하면서 TTS 모델의 학습/추론 효율성을 개선한 SpeedySpeech를 제안

SpeedySpeech
- FastSpeech와 유사한 teacher-student network를 도입
  - Transformer 대신 single attention layer를 가지는 더 간단한 convolution teacher model을 활용
- Teacher network는 phoneme과 해당 audio frame 간의 정확한 alignment를 추출하는 autoregressive convolution network
- Student network는 input phoneme을 encoding 하고 각 duration을 예측하여 mel-scale spectrogram을 decoding 하는 non-autoregressive fully-convolutional network

< Overall of SpeedySpeech >

FastSpeech의 teacher-student architecture를 단순화하여 빠르고 안정적인 학습을 지원
Student network에 대해서 self-attention layer의 사용이 불필요함을 보임
Teacher network 학습에 사용할 수 있는 data augmentation을 도입
결과적으로 빠른 학습/추론 속도를 유지하면서 우수한 합성 품질을 달성

2. Method

SpeedySpeech는 phoneme을 input으로 사용하여 log mel-spectrogram을 output 함

- Teacher Network - Duration Extraction

Teacher network는 data에서 phoneme duration을 추출하기 위해 Deep Voice3, DCTTS를 기반으로 활용함
- Phoneme encoder, Spectrogram decoder, Attention, Decoder의 4가지 부분으로 구성
- Teacher model은 input phoneme과 이전 frame을 기반으로 다음 spectrogram frame을 예측하는 것을 목표로 함
  - Attention은 생성되는 phoneme에 대한 tracking을 유지하기 위해 사용되고,
  - Attention value를 통해 phoneme과 spectrogram을 align하고, phoneme duration을 추출

Phoneme Encoder
- Phoneme encoder는 embedding과 fully-connected layer, ReLU activation으로 시작하여
- 이후 dilated non-casual convolution을 포함하는 gated residual block이 이어지는 구조
  - Gated residual blcok의 skip connection은 encoder output에 대해 모든 layer의 output을 합산함
- 이때 DCTTS의 highway block 대신 WaveNet의 convolutional residual block을 활용함

Spectrogram Encoder
- Spectrogram encoder는 이전 frame을 고려하여 spectrogram frame에 대한 contextual encoding을 제공
- 구조적으로는,
  - Input spectrogram의 각 frame에 fully-connected layer와 ReLU를 적용하고,
  - 이전 frame에 대해서만 dilated gated casual convolution이 있는 gated residual block을 적용한 다음,
  - 최종적으로 skip connection을 final output에 합산
Attention
- Dot-product attention을 사용
  - Key : Phoneme encoder의 output
  - Query : Spectrogram encoder의 output
  - Value : Phoneme embedding과 Phoneme encoder의 output을 합산한 결과
- Key와 Query는 positional encoding과 동일한 linear layer를 통해 pre-condition 됨
  - Attention이 monotonicity 하도록 bias
- Attention score는 value가 주어진 query와 얼마나 일치하는지에 대한 value vector의 weighted average
  - 이를 통해 model은 다음 spectrogram frame 예측에 관련된 phoneme을 선택하는 방법을 학습할 수 있음
Decoder
- Decoder는 더 나은 gradient flow를 위해 attention score를 encoder ouput과 합산하는 역할
- 합산된 결과는 gated residual block과 ReLU activation이 적용된 convolution layer, Sigmoid prediction layer를 통과
Training
- Target spectrogram은 input에서 왼쪽으로 one position shift 되고,
  - 이때 model은 input phoneme과 이전 frame을 기반으로 다음 spectrogram을 예측하도록 학습됨
  - Tacotron2와 달리 network는 hidden state를 keep 하지 않고 모든 time step에 대한 예측을 병렬로 계산 가능
  - Model은 마지막 layer에서 Sigmoid activation을 적용할 수 있도록 log mel-spectrogram을 $[0,1]$ interval로 rescale
- 학습은 target과 예측된 spectrogram 사이의 Mean Absolute Error (MAE) 합과 monotonic alignmnet를 위해 사용되는 guided attention loss를 최소화하는 것으로 수행됨
- 이때, attention matrix $A \in \mathbb{R}^{N \times T}$에 대한 guided attention loss는:
  $GuidedAtt(A) = \frac{1}{NT} \sum_{n=1}^{N} \sum_{t=1}^{T} A_{n,t}W_{n,t}$
  - $W_{n,t} = 1- exp \left ( - \frac{(n/N-t/T)^{2}}{2g^{2}} \right )$ : penalty matrix
  - $N$ : phoneme 수, $T$ : spectrogram frame 수
  - $g$ : diagonal 하지 않은 matrix 성분 $A_{n,t}$의 loss contribution을 제어
Data Augmentation
- Error propagation에 대한 robustness 향상을 위해 input spectrogram에 3가지 data agumentation을 적용
  1. 각 spectrogram pixel에 소량의 Gaussian noise를 추가
  2. Parallel mode에서 gradient 업데이트 없이 network를 통해 input spectrogram을 제공하여 model output을 simulation
    - 이때 resulting spectrogram은 ground-truth spectrogram에 비해 degrade 됨
    - 이후, 순차 spectrogram에 대한 근사를 얻기 위해 해당 process를 반복
    - 학습 초기에서는 model이 순차적인 frame을 올바르게 생성하기 어렵기 때문에, 결과적으로 순차 생성의 robustness를 향상 가능
  3. 일부 frame을 random frame으로 replace
    - Model이 시간적으로 더 멀리 위치한 frame을 사용하도록 유도
    - 결과적으로 최근 frame에만 overfitting 되는 것을 방지하고, 오래된 information도 사용함으로써 안정성을 향상
Inference / Duration Exraction
- Phoneme skipping을 방지하고 monotonic alignment를 적용하기 위해 attention position에 대한 location masking을 적용
- 추론은 teacher-forcing mode를 통해 수행함
  - Error propagation을 방지하고 안정적인 alignment를 추출하기 위해 model에 ground-truth를 제공하는 방식
- Resulting attention matrix는 각 time step에서 가장 가능성이 높은 phoneme의 index를 계산하고, time에 따른 각 index의 occurrence 수를 계산하여 각 phoneme duration을 추출하는 데 사용됨

- Student Network - Spectrogram Synthesis

Student model은 앞선 teacher model이 예측한 alignment와 spectrogram을 사용함
- Student model은 input phoneme이 주어지면 개별 phoneme duration을 예측한 다음, duration을 기반으로 전체 mel-spectrogram을 예측하는 것을 목표로 함
- Phoneme encoder, Duration predictor, Decoder의 3가지 부분으로 구성
  - 위 3가지 module은 모두 dilated residual convolution block으로 구성됨
  - 각 residual block은 1D convolution, ReLU activation, temporal batch normalization으로 구성되고, residual connection이 적용됨
  - 결과적으로 student model은 FastSpeech의 attention을 convolution으로 대체하고 layer normalization 대신 temporal batch normalization을 사용함
- Encoder에 의해 생성된 phoneme encoding은 Duration predictor에 전달되고, convolution과 linear layer를 통해 logarithmic domain에서 각 phoneme의 duration을 예측
Phoneme encoding vector는 output spectrogram의 size와 일치하도록 예측된 duration에 따라 확장됨
- FastSpeech와 유사하게 positional encoding을 추가하지만, 각 phoneme에 대한 encoding을 reset 함
  - 이는 network가 full sentence 대신 single phoneme의 context에서 frame location을 구별하는 것이 더 유리하기 때문
- 이후 Decoder는 positional embedding이 포함된 확장된 phoneme encoding을 mel-spectrogram의 개별 frame으로 변환

Training
- Log mel-spectrogram value regression과 Huber loss에 대해 MAE와 structural similarity index (SSIM) loss의 합을 채택하여 log duration 예측에 사용
- Phoneme encoding expansion을 위해, 학습 과정에서 teacher model로부터 추출된 ground-truth duration을 사용
  - 이때 target log mel-spectrogram이 평균 0과 unit 분산을 가지도록 normalize
- FastSpeech와 달리 SpeedySpeech는 duration predictor에서 encoder까지의 gradient flow를 분리함
  - 결과적으로 spectrogram 예측 성능이 향상되고 duration predictor의 overfitting을 방지

3. Experiments

- Settings

Dataset : LJSpeech
Comparisons : DeepVoice3, Tacotron2

- Results

Voice Quality
- SpeedySpeech의 Mean Score는 75.24로 비교 대상인 Tacotron2, Deep Voice3 보다 우수한 합성 품질을 보임
- SpeedySpeech의 output은 다른 모델에 비해 pronunciation 실수가 적고 intonation consistency가 높은 것으로 나타남

Inference Speed
- 추론 속도 측면에서 SpeedySpeech는 GPU에서 197ms 내에 9.72초의 audio를 합성 가능
  - 동일한 환경에서 Tacotron2보다 8.8배 빠르고, spectrogram 생성은 48.5배 빠름
- Batch를 활용하면 $16 \times 9.72 = 155.52$초의 audio를 4.27초 만에 합성 가능

Training Time
- Teacher 모델은 student 보다 더 작지만, 좋은 결과로 수렴하기 위해서는 더 작은 learning rate를 사용해야 하므로 학습 속도가 느려짐
- Student 모델은 크기가 크지만, architecture가 더 단순하고 attention과 같은 학습하기 어려운 component를 포함하지 않으므로 더 쉽게 수렴할 수 있음

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech (0)	2024.01.21
[Paper 리뷰] CyFi-TTS: Cyclic Normalizing Flow with Fine-Grained Representation for End-to-End Text-to-Speech (0)	2024.01.18
[Paper 리뷰] Personalized Lightweight Text-to-Speech: Voice Cloning with Adaptive Structured Pruning (0)	2024.01.10
[Paper 리뷰] LiteTTS: A Lightweight Mel-spectrogram-free Text-to-wave Synthesizer Based on Generative Adversarial Networks (0)	2024.01.08
[Paper 리뷰] Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search (0)	2023.12.20

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] SpeedySpeech: Efficient Neural Speech Synthesis

SpeedySpeech: Efficient Neural Speech Syntheis

1. Introduction

2. Method

- Teacher Network - Duration Extraction

- Student Network - Spectrogram Synthesis

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바