[Paper 리뷰] DeviceTTS: A Small-Footprint, Fast, Stable Network for On-device Text-to-Speech

티스토리 뷰

Paper/TTS

[Paper 리뷰] DeviceTTS: A Small-Footprint, Fast, Stable Network for On-device Text-to-Speech

feVeRin 2024. 6. 17. 09:09

DeviceTTS: A Small-Footprint, Fast, Stable Network for On-device Text-to-Speech

기존의 tex-to-speech 모델은 크고 복잡한 network로 구성되기 때문에, 원활한 배포를 지원할 수 있는 on-device text-to-speech에 적합한 모델이 필요함
DeviceTTS
- Duration predictor를 통해 encoder, decoder 간의 bridge를 제공
- 모델 size를 줄이기 위해 Deep Feedforward Sequential Memory Network (DFSMN)을 도입
- 추가적으로 추론 속도를 높이기 위해, mix-resolution decoder를 채택
논문 (ICASSP 2021) : Paper Link

1. Introduction

Neural Text-to-Speech (TTS)는 뛰어난 합성 품질을 보이고 있지만, training이 어렵고 real-time synthesis가 어려운 complex network에 의존함
- 특히 최근에는 human-computer interaction 측면에서 on-device TTS는 점점 중요해지고 있음
  1. 이때 Tacotron과 같은 TTS 모델은 encoder-decoder architecture와 attention mechanism을 활용함
  2. BUT, alignment가 잘 학습되지 않은 경우 word skipping이나 repeating의 문제가 발생할 수 있음
    - 결과적으로 기존 TTS 모델을 on-device에 대해 compression 하면, alignment를 위한 충분한 parameter를 확보하지 못하므로 robustness가 상당히 저하됨
- 한편으로 FastSpeech는 encoder-decoder에서 Feed-Forward Transformer block을 활용한 non-autoregressive model을 활용함
  - 이때 Tacotron의 word skipping 문제를 해결하기 위해, phoneme duration predictor가 encoder와 decoder 간의 bridge로 도입됨
  - BUT, FastSpeech는 on-device에서 training set의 maximum length를 초과하는 utterance에 대해 generalization이 부족함

-> 그래서 on-device TTS를 위한 small-footprint, fast, stable network인 DeviceTTS를 제안

DeviceTTS
- FastSpeech를 기반으로 word skipping과 repeating을 방지하기 위해 encoder, decoder 사이에 duration predictor를 bridge로 사용
- Deep Feedforward Sequential Memory Network (DFSMN)을 도입해 작은 size를 가지는 feedforward neural network를 구성
  - 해당 DFSMN block을 통해 limited parameter로도 만족스러운 prosody를 가진 고품질 음성을 합성 가능
- 모델의 추론 속도를 향상하기 위해, decoder에 multi-frame prediction을 적용해 한 step 당 $r\,\,(r>1)$개의 acoustic feature frame을 합성하도록 함
  1. 이때 multi-frame prediction의 coarser grained acoustic feature는 model parameter가 제한될 때 unnatural 한 음성을 만들어낼 수 있음
    - 따라서 논문은 이를 해결하기 위해 mix-resolution decoder를 채택
  2. Mix-resolution decoder에서는 multi-frame output이 reshape 되어 single-frame prediction을 수행하는 refine network로 전달됨
    - 여기서 refine network는 acoustic feature를 finer grained modeling 하는 역할

< Overall of DeviceTTS >

Duration predictor를 통해 encoder, decoder 간의 bridge를 제공
모델 size를 줄이기 위해 Deep Feedforward Sequential Memory Network (DFSMN)을 도입하고, 추론 속도를 향상하는 mix-resolution decoder를 채택
결과적으로 0.099 GFLOPS 만으로 Tacotron, FastSpeech 수준의 합성 성능을 달성

2. Method

DeviceTTS는 크게 encoder, duration predictor, length regulator, decoder로 구성됨
- Encoder는 text의 robust sequential representation을 추출하는 역할
  1. 여기서 input은 character/phoneme을 one-hot vector로 represent 하고 continuous vector에 embed 하여 얻어짐
  2. Encoder ouput은 duration predictor에 전달되어 각 input character/phoneme의 frame 수를 얻음
  3. 이후 length regulator (LR)은 예측된 frame 수로 encoder output을 expand 함
  4. 최종적으로 decoder는 expanded representation을 사용하여 acoustic feature를 생성함
- DeviceTTS의 training은 다음의 acoustic feature loss와 phone duration loss를 결합하여 수행됨:
  (Eq. 1) $\mathcal {L}=\mathcal {L}_{aco}+\mathcal {L}_{dur}$
  - 논문에서는 Mean Absolute Error (MAE) loss를 사용

- DFSMN Block

DFSMN block은 encoder, duration predictor, decoder의 핵심 component로 사용됨
- DFSMN block은 hidden layer에 memory block이 있는 standard feedforward neural network로 구성됨
  1. 이때 memory block은 previous hidden layer의 output과 current layer의 previous history를 fixed-size representation으로 encoding 하는 역할
    - 해당 memory block을 통해 DFSMN은 recurrent feedback 없이도 long-term dependency를 학습할 수 있음
  2. 추가적으로 skip connection을 채택하여, back-propagation 중에 higher layer의 gradient를 lower layer에 직접 전달해 gradient-vanishing 문제를 극복함
- DFSMN block의 formulation은 다음과 같음:
  (Eq. 2) $p_{t}^{l}=f(V^{l}h_{t}^{l-1}+b_{v}^{l})$
  (Eq. 3) $\tilde{h}_{t}^{l}=U^{l}p_{t}^{l}+b_{u}^{l}$
  (Eq. 4) $\hat{h}_{t}^{l}=\tilde{h}_{t}^{l}+\sum_{i=0}^{N_{1}}a_{i}^{l}\odot \tilde{h}_{t-i}^{l}+\sum_{j=1}^{N_{2}}c_{j}^{l}\odot \tilde{h}_{t+j}^{l}$
  (Eq. 5) $h_{t}^{l}=h_{t}^{l-1}+\hat{h}_{t}^{l}$
  - $p_{t}^{l}$ : time $t$에서의 Affine Transform의 output, $V^{l}, b^{l}$ : 해당 weight, bias
  - $h_{t}^{l-1}, h_{t}^{l}$ : 각각 $(l-1)$-th, $l$-th hidden layer
  - $\tilde{h}_{t}^{l}$ : model compression에 사용되는 projection output
  - $\hat{h}_{t}^{l}$ : context information이 있는 current memory block의 output
  - (Eq. 4)에서 $a_{j}^{l}, c_{j}^{l}$은 look-back filter와 look-ahead filter이고, $N_{1}, N_{2}$는 각각의 order를 의미
- DFSMN에서 total latency는 각 memory block의 look-ahead filter order $N_{2}$와 관련됨
  - 여기서 DFSMN block은 latency control window size를 사용하여 context를 학습하고 local modeling method를 통해 network를 stable 하게 만듦

- Mix-Resolution Decoder

Mix-Resolution (MR) decoder는 Autoregressive (AR) network와 refine network로 구성됨
- AR network는 multi-frame ($r=3$)에서 autoregressive prediction을 수행함
  - LR-output이 주어지면, multi-frame 수 $r$과 동일한 interval로 sampling 하여 해당하는 frame을 select 함
  - 이후 selected frame은 recurrent neural network에 대한 Prenet의 output과 concatenate 됨
- 다음으로 multi-frame output은 single-frame prediction을 수행하는 refine network에 전달됨
  - Refine network는 acoustic feature를 finer grained modeling 하는 역할

3. Experiments

- Settings

Dataset : Mandarin Speech Dataset
Comparisons : Tacotron, FastSpeech

- Results

MOS 측면에서 DeviceTTS는 기존 모델들과 비교할만한 수준의 합성 품질을 보임

Complexity 측면에서 DeviceTTS는 가장 낮은 parameter 수와 GFLOPS를 가짐

Ablation Study
- 먼저 MR decoder를 제거하는 경우, 0.254의 CMOS 저하가 발생함

마찬가지로 AR network를 non-AR network로 대체하는 경우, CMOS 저하와 GFLOPS의 증가가 발생함

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] Glow-WaveGAN: Learning Speech Representations from GAN-based Variational Auto-Encoder for High Fidelity Flow-based Speech Synthesis (0)	2024.06.20
[Paper 리뷰] GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis (0)	2024.06.19
[Paper 리뷰] EdiTTS: Score-based Editing for Controllable Text-to-Speech (0)	2024.06.11
[Paper 리뷰] EATS: End-to-End Adversarial Text-to-Speech (0)	2024.06.09
[Paper 리뷰] MSMC-TTS: Multi-Stage Multi-Codebook VQ-VAE based Neural TTS (0)	2024.06.08

최근에 올라온 글

최근에 달린 댓글

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] DeviceTTS: A Small-Footprint, Fast, Stable Network for On-device Text-to-Speech

DeviceTTS: A Small-Footprint, Fast, Stable Network for On-device Text-to-Speech

1. Introduction

2. Method

- DFSMN Block

- Mix-Resolution Decoder

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바