[Paper 리뷰] FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis

티스토리 뷰

Paper/TTS

[Paper 리뷰] FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis

feVeRin 2024. 7. 8. 09:37

FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis

Fast, Lightweight Text-to-Speech 모델에 대한 요구사항이 커지고 있음
FLY-TTS
- Decoder를 Fourier spectral coefficient를 생성하는 ConvNeXt block으로 대체하고, inverse STFT를 적용하여 waveform을 합성
- Model size를 compress 하기 위해 text encoder와 flow-based model에 grouped parameter-sharing을 도입
- 추가적으로 합성 품질 향상을 위해 large pre-trained WavLM을 통해 adversarial training 함
논문 (INTERSPEECH 2024) : Paper Link

1. Introduction

Text-to-Speech (TTS)는 input text를 speech waveform으로 변환하는 것을 목표로 함
- 특히 Glow-TTS, VITS, NaturalSpeech 등의 최신 TTS 모델은 합성된 음성의 naturalness를 크게 향상함
- BUT, 해당 TTS 모델들을 실적용하기에는 다음의 한계가 있음:
  1. Edge/mobile device에 배포하기에는 model size가 상당히 큼
  2. 느린 추론 속도로 인해 low-resource 환경에서 사용하기 어려움
  3. 일반적으로 model size가 클수록 성능이 향상되지만 trade-off가 존재함

-> 그래서 앞선 문제들을 해결할 수 있는 Fast, Lightweight, high-qualitY-TTS 모델인 FLY-TTS를 제안

FLY-TTS
- VITS를 기반으로 뛰어난 합성 품질을 유지하면서 추론 속도와 model size를 줄이는 것을 목표로 함
  1. 특히 VITS의 HiFi-GAN decoder는 추론 속도에 대한 주요 bottleneck이므로 ConvNeXt block을 도입해 Fourier spectral coefficient를 생성하고, inverse STFT를 적용해 raw waveform reconstruction 속도를 향상
  2. Text encoder, Flow-based model에는 grouped parameter-sharing을 적용해 model size를 크게 절감
- Model compression으로 인한 합성 품질 저하를 완화하기 위해 adversarial training을 위한 discriminator로써 large pre-trained WavLM을 채택
  - 이를 통해 self-supervised representation을 generator에 제공함으로써 음성 품질을 향상 가능

< Overall of FLY-TTS >

Fourier coefficient를 생성하는 ConvNeXt decoder와 grouped paramter-sharing을 활용한 경량 TTS 모델
결과적으로 기존 TTS 모델 수준의 합성 품질을 유지하면서 훨씬 적은 parameter 수와 빠른 추론 속도를 달성

2. Method

FLY-TTS는 end-to-end TTS 모델인 VITS를 기반으로 구성됨
- 먼저 VITS는 input condition $c$가 주어졌을 때, target data $x$의 log-likelihood $p_{\theta}(x|c)$의 variational lower bound를 최대화하는 conditional VAE:
  (Eq. 1) $\log p_{\theta}(x|c)\geq \mathbb{E}_{q_{\phi}(z|x)}\left[\log p_{\theta}(x|z)-\log \frac{q_{\phi}(z|x)}{p_{\theta}(z|c)}\right]$
  - $z$ : latent variable, $p_{\theta}(z|c)$ : condition $c$가 주어졌을 때 $z$의 prior distribution
  - $p_{\theta}(x|z)$ : $z$가 주어졌을 때 likelihood, $q_{\phi}(z|x)$ : 근사 posterior distribution
- VITS는 각각 $q_{\phi}(z|x), p_{\theta}(z|c), p_{\theta}(x|z)$에 해당하는 posterior encoder, prior encoder, decoder로 구성되고, adversarial training을 위한 discriminator를 도입해 합성 품질을 향상함
  1. Prior Encoder
    - Prior encoder $E_{prior}$는 input phoneme $c$를 receive 하고 prior distribution을 예측함
    - 구조적으로는 input processing을 위한 text encoder와 prior distribution의 flexibility를 향상하는 normalizing flow $f_{\theta}$로 구성
  2. Posterior Encoder
    - Posterior encoder $E_{posterior}$는 linear spectrum에서 동작하여 근사 posterior distribution의 평균/분산을 예측함
    - 해당 module은 training에서만 사용되므로 추론 속도에는 영향을 주지 않음
  3. Decoder
    - Decoder $E_{decoder}$는 latent $z$로부터 waveform을 생성함
    - 일반적으로 HiFi-GAN generator로 구성됨
  4. Discriminator
    - Discriminator $D$는 HiFi-GAN을 따라 구성됨
    - Adversarial training을 위한 Multi-Period Discriminator, Multi-Scale Discriminator를 포함
- 결과적으로 해당 VITS 구조를 기반으로 FLY-TTS는 몇 가지 수정을 통해 lightweight TTS 모델을 구축함

(a) Overall (b) Text Encoder & Flow-based Model (c) ConvNeXt-based Decoder (d) Pre-trained WavLM

- Grouped Parameter-Sharing

Parameter-sharing은 parameter efficiency를 향상하기 위해 사용됨
- 논문에서는 model size와 expressiveness power 간의 trade-off를 만족하기 위해 prior encoder의 text encoder와 flow-based model에 grouped parameter-sharing을 적용함
- 먼저 VITS의 original text encoder는 multi-layer transformer encoder로 구성됨
  1. 해당 transformer layer에는 redundancy가 존재하므로 parameter-sharing을 통해 성능을 크게 저하시키기 않으면서 model size를 줄일 수 있음
  2. 따라서 group parameter-sharing strategy를 적용해, 동일한 parameter를 sequential $m_{1}$ layer에 할당하는 방식으로 총 $g_{1}\times m_{1}$개의 layer 만을 사용하도록 함
    - 여기서 $g_{1}$은 group 수이고, $g_{1}=1$이면 grouped parameter-sharing은 complete parameter-sharing이 됨
- 한편 flow-based model 역시 large memory footprint로 인한 문제가 존재함
  1. 따라서 앞선 text encoder의 grouped parameter-sharing과 마찬가지로, flow $\mathbf{f}_{1}, \mathbf{f}_{2}, ..., \mathbf{f}_{K}$의 $K=g_{2}\times m_{2}$ step을 $g_{2}$ group으로 나눔
    - 이때 각 group은 $m_{2}$ flow step을 포함
  2. 추가적으로 PortaSpeech와 같이 affine coupling layer의 모든 module의 parameter를 share 하지 않음
    - 대신 NanoFlow를 따라 WaveNet으로 구성된 projection layer의 parameter만을 share 하여 각 module 간의 parameter independence를 유지함

- ConvNeXt-based Decoder

VITS decoder는 HiFi-GAN vocoder를 기반으로 transposed convolution을 통해 representation $z$로부터 waveform을 합성함
- 따라서 VITS에는 upsampling process의 time-consuming nature로 인해 추론 속도에 대한 bottleneck이 존재함
- 이를 해결하기 위해, 논문은 Vocos를 따라 ConvNeXt block을 backbone으로 하여 동일한 temporal resoultion의 Fourier time-frequency coefficient를 생성함
  - 이후 inverse STFT (iSTFT)를 적용해 raw waveform을 합성하여 계산 비용을 크게 줄임
- 구조적으로 ConvNeXt module은 $7\times 7$ depthwise convolution과 2개의 $1\times 1$ pointwise convolution, GELU activation으로 구성됨
  1. 구체적으로 latent variable $z$가 주어지면, feature sequence $S=[s_{1}, s_{2},...,s_{T}],\,\, s_{i}\in \mathbb{R}^{D}$를 얻기 위해 sampling을 수행함
    - $D$ : hidden representation dimension, $T$ : acoustic frame 수
  2. 이후 feature는 iSTFT의 frequency bin 수 $N$과 match 되도록 embedding layer를 통과함
  3. 결과적으로 ConvNeXt block의 stacked layer는 Fourier time-frequency coefficient $M=[m_{1},m_{2},...,m_{T}], P=[p_{1},p_{2},...,p_{T}]$를 생성함:
    (Eq. 2) $[M,P]=\text{ConvNeXts}(\text{Embed}(S))$
    - $m_{i}\in \mathbb{R}^{N}$ : complex Fourier coefficient의 amplitude, $p_{i}\in \mathbb{R}^{N}$ : phase
- iSTFT transform은 waveform $\hat{y}$를 얻기 위해 사용됨:
  (Eq. 3) $\hat{y}=\text{iSTFT}(M,P)$
- 실제로 iSTFT 구현에는 Fast Fourier Transform (FFT) algorithm이 적용됨
  - 이때 Fourier transform coefficient의 temporal resolution $T$는 raw waveform의 sample 수보다 훨씬 작으므로 합성 속도를 가속할 수 있음

- Pre-trained Speech Model for Adversarial Training

VALL-E, AudioLM에서와 같이 pre-trained large speech model은 rich acoustic, semantic information을 포함하므로 고품질 합성을 지원할 수 있음
- BUT, generator에 pre-trained large speech model을 적용하면 일반적으로 상당한 computational overhead가 발생하므로 빠른 합성에는 적합하지 않음
- 따라서 FLY-TTS는 adversarial training을 위한 discriminator로써 pre-trained WavLM을 활용해 해당 문제를 회피함
  - 이를 통해 generator의 model size, 추론 속도에 영향을 주지 않으면서 self-supervised model에서 학습된 rich acoustic, semantic information을 반영하여 generator를 업데이트할 수 있음
- 구조적으로 WavLM은 wav2vec2를 backbone으로 하여 convolutional feature encoder와 transformer encoder로 구성된 self-supervised model
  1. 이때 speech waveform은 16kHz로 resampling 된 다음, WavLM을 통해 intermediate feature를 추출함
  2. 이후 prediction head는 feature를 기반으로 discriminative prediction을 수행함
    - Prediction head는 StyleTTS2를 따라 Leaky ReLU activation을 사용하는 convolution network로 구성됨
- 여기서 FLY-TTS는 additional adversarial loss로써 least square loss를 사용함:
  (Eq. 4) $\mathcal{L}_{adv}(D_{w})=\mathbb{E}_{(y,z)}\left[(D_{w}(y)-1)^{2}+(D_{w}(\hat{y}))^{2}\right]$
  (Eq. 5) $\mathcal{L}_{adv}(G)=\mathbb{E}_{z}\left[(D_{w}(\hat{y})-1)^{2}\right]$
  - $D_{w}$ : WavLM discriminator, $G$ : FLY-TTS generator
  - $y$ : real speech, $\hat{y}=G(z)$ : synthesis speech
- 결과적으로 WavLM으로 인한 computational overhead를 완화하기 위해 WavLM의 parameter를 수정하고 prediction head만 업데이트하므로 overfitting의 위험도 줄일 수 있음

3. Experiments

- Settings

Dataset : LJSpeech
Comparisons : VITS, MB-iSTFT-VITS

- Results

RTF, parameter 수 측면에서 FLY-TTS는 기존보다 훨씬 효율적임

합성 품질 면에서도 FLY-TTS는 VITS 수준의 성능을 달성함

Ablation study 측면에서 ConvNeXt와 WavLM discriminator가 대체되는 경우, 성능 저하가 발생함

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] Light-TTS: Lightweight Multi-Speaker Multi-Lingual Text-to-Speech (0)	2024.07.10
[Paper 리뷰] Lightweight Zero-Shot Text-to-Speech with Mixture of Adapters (0)	2024.07.09
[Paper 리뷰] MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech (0)	2024.07.05
[Paper 리뷰] DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning (0)	2024.07.04
[Paper 리뷰] VECL-TTS: Voice Identity and Emotional Style Controllable Cross-Lingual Text-to-Speech (0)	2024.07.02

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis

FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis

1. Introduction

2. Method

- Grouped Parameter-Sharing

- ConvNeXt-based Decoder

- Pre-trained Speech Model for Adversarial Training

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바