[Paper 리뷰] Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation

티스토리 뷰

Paper/TTS

[Paper 리뷰] Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation

feVeRin 2023. 7. 15. 16:30

Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation

Text-to-Speech (TTS) 모델은 최적화하기 어렵거나 많은 학습 비용이 발생함
Nix-TTS
- Knowledge distillation을 활용한 non-autoregressive end-to-end 경량 TTS 모델 (Vocoder-free!)
- Encoder, Decoder 모듈에 대해 유연하고 독립적인 distillation을 가능하게 하는 Module-wise distillation 활용
논문 (SLT 2022) : Paper Link

1. Introduction

최근의 TTS 모델은 크기가 상당히 크고 CPU 추론 속도가 느림
- 저비용, 자원 제약 환경에서 음성 기반 interface를 배포하기 어렵게 만드는 주요한 원인
- 저비용 CPU bound device에 배포하기 위해서는 TTS 모델이 가볍고 빠르면서도 자연스러운 음성 합성이 가능해야함
- 경량 TTS 모델에 대한 연구가 꾸준히 제안되었지만 대부분 text-to-Mel에 초점을 맞추고 있음
  - 음성 합성을 위해 추가적인 vocoder가 필요하기 때문에 vocoder에 따라 모델 크기가 가변적으로 변함
  - 경량 TTS 모델을 위해서는 vocoder를 사용하지 않는 end-to-end 설계가 필요
Neural compression을 활용하여 모델의 크기를 줄이는 방법도 있음
- Neural Architecture Search를 적용하는 방법 : 적합한 search space 정의하는 것이 어려울 수 있음
- Architecture pruning을 적용하는 방법 : 생성된 음성의 자연스러움이 떨어짐

-> 그래서 성능은 유지하면서 학습 비용이 낮은 경량 TTS 모델인 Nix-TTS를 제안

Nix-TTS
- Knowledge distillation (KD)을 non-autoregressive end-to-end TTS teacher 모델에 적용
- Teacher network의 duration만을 distill 하는 것으로 KD를 수행
- Encoder, Decoder에 대한 module-wise distillation

< Overall of Nix-TTS >

추가적인 vocoder가 필요 없는 non-autoregressive end-to-end 특성을 상속한 TTS 모델
상당히 작은 크기와 추론 속도 향상을 달성
음성 합성의 자연스러움을 유지하고 teacher 모델과 비슷한 명료도를 달성

2. Method

- Problem Formulation

$F(\cdot; w)$ : end-to-end neural TTS 모델
- End-to-End : 외부 vocoder 없이 직접 text $c$에서 raw waveform $x_{w}$의 음성 데이터 $x$를 생성하는 것을 의미
End-to-End TTS 모델의 architecture
- $E$ : Encoder, $D$ : Decoder
  - $E$는 $c$를 latent representation으로 encoding
  - $D$는 $z$를 $x_{w}$로 decoding
- 모델에 따라 $z$는 결정적이거나 $z \sim N(\mu, \sigma)$ 같은 분포를 따라 생성적일 수 있음

KD setting
- (목표) Loss function $L_{E}$, $L_{D}$가 주어졌을 때 $F_{s}$를 학습시켜 $E_{s}$, $D_{s}$를 만족시키는 것
  - $F_{s}$는 $F_{t}$에 가깝게 대응시키는 $\hat{z}$ 및 $\hat{x}_{w}$를 $E_{s}$를 통해 생성
- $F_{t}$ : teacher model, $F_{s}$ : student model
- $\{z, x_{w} \}$, $\{\hat{z},\hat{x}_{w} \}$ : 각각 teacher, student model에 의해 생성된 output

- End-to-End TTS Teacher

End-to-End non-autoregressive TTS 모델인 VITS를 teacher 모델 $F_{t}$로 선택
- VITS는 conditional Variational AutoEncoder (cVAE)로 공식화 가능
- VITS에서 제안된 cVAE를 따라 $q_{\theta}(z|x), p_{\phi}(x|z)$를 $\theta, \phi$로 parameterized
  - $q_{\theta}(z|x)$ : 사후 분포, $p_{\phi}(x|z)$ : 데이터 분포
  - $x$ : 음성 데이터 변수, $z$ : latent 변수
- $z$의 사전 분포는 $p_{\psi}(z|c)$로 정의
  - latent는 input text $c$에 의해 condition 되고 $psi$로 parameterized
- (VITS의 목표) $c$가 주어졌을 때 $x$의 분포 하에서 evidence lower bound (ELBO)를 최대화하는 방향으로 학습

Reconstruction term은 Mel-spectrogram $x_{m}$을 이용
- Ground truth와 예측된 음성 사이의 L1 Loss로 구성
- $\hat{x}_{m} \sim p_{\phi}(x|z)$

Architecture 측면에서 VITS는 $q_{\theta}(z|x), p_{\phi}(x|z), p_{\psi}(z|c)$ 분포에 대응하는 3개의 모듈로 나눌 수 있음
- Posterior Encoder
  - Non-casual WaveNet residual block으로 구성
  - Linear spectrogram $x_{s}$의 $x$를 $q_{\theta}(z|x) = N(\mu_{q}, \sigma_{q})$의 parameter인 $\{ \mu_{q}, \sigma_{q} \}$로 encoding
  - Latent sample $z_{q} \sim N(\mu_{q}, \sigma_{q})$를 추론한 다음 deccoder로 전달되어 raw waveform $x_{w}$에서 $x$로 reconstruction
- Prior Encoder
  - Transformer encoder 블록과 affine coupling layer가 있는 normalizing flow $f$로 구성
  - $c$를 $p_{\psi}(z|c) = N(\mu_{p}, \sigma_{p})$의 parameter인 $\{ \mu_{p}, \sigma_{p} \}$와 prior latent sample $z_{p} = f(z_{q})$로 encoding
  - $\{ \mu_{p}, \sigma_{p} \}$와 $z_{p}$은 Monotonic Alignment Search (MAS)로 정렬
  - 추론과정에서 network는 aligned prior parameter $\{ \mu_{p}^{'}, \sigma_{p}^{'} \}$와 $x_{s}$을 사용해 $f^{-1}(\mu_{p}^{'}, \sigma_{p}^{'})$으로부터 $z_{q}$를 추론
- Decoder
  - HiFi-GAN v1의 generator architecture를 따름
  - Multi-period discriminator를 사용하여 적대적인 방식으로 $z_{q}$를 $x_{w}$로 reconstruction
Available Knowledge to be Distilled
- Teacher VITS가 이미 학습되어 있다고 가정하면, Encoder-Decoder 구조를 활용할 수 있음
  - Prior Encoder는 latent 분포 $q_{\theta}(z|x)$를 모델링하는 $E_{t}$의 역할
  - Decoder는 latent sample $z_{q} \sim q_{\theta}(z|x)$에서 $x_{w}$를 decoding 하는 $D_{t}$의 역할
- Prior Encoder와 Posterior Encoder는 모두 동일한 latent space를 encoding 하므로 모두 $E_{t}$로 볼 수 있음
  - $q_{\theta}(z|x)$에 대한 stochastic sample만을 제공하는 $f$로 인해 복잡한 Prior Encoder에서도 student $E_{s}$가 distill 될 수 있음
  - Prior encoder은 복잡하기 때문에 쉬운 Posterior Encoder에서 $q_{\theta}(z|x)$를 distill

- End-to-End TTS Student

Nix-TTS는 end-to-end TTS student 모델 $F_{s}$의 역할을 함
- End-to-End TTS teacher 모델 $F_{t}$인 VITS로부터 distill 되어 얻어짐
Encoder Architecture
- (Nix-TTS encoder의 목표) parameter $\{ \mu_{q}, \sigma_{q} \}$를 예측하여 $q_{\theta}(z|x) = N(\mu_{q}, \sigma_{q})$를 모델링
  - $c$ 대신 $x_{s}$를 condition으로 가짐
  - $x_{s}$와 의미 있게 align 되도록 $c$를 encoding
- Text Encoder, Text Aligner, Duration Predictor, Latent Encoder의 4가지 모듈로 $E_{s}$ 구성

Text Encoder
- $c$를 text hidden representation $c_{hidden}$으로 encoding
- Embedding layer, absolute positional encoding, stacked dilated residual 1D convolution block로 $c$를 통과
- 각 convolution block에는 SiLU activation, layer normalization 사용
Text Aligner
- $c$와 $x_{s}$의 alignment를 학습하기 위해 사용됨
- Convolution layer를 사용해 $c_{hidden}, x_{s}$를 $c_{enc}, x_{enc}$로 encoding
  1. Soft alignment ($A_{soft}$)
    - 둘 사이의 normalized pairwise affinity를 취해 얻어짐
  2. Hard alignment ($A_{hard}$)
    - non-autoregressive TTS가 아닌 경우 token 당 hard duration을 정의해야 함
    - MAS를 $A_{soft}$에 적용
  3. $c_{hidden}$과 $A_{hard}$ 사이에 batch matrix-matrix product를 적용해 aligned text representation $c_{aligned}$ 얻음

Duration Predictor
- 추론 과정에서 $x_{s}$ 없이 $A_{hard}$를 예측하는 역할
- $A_{hard}$에서 추출한 per-token duration $d_{hard}$를 예측하기 위해 1D convolution을 stack 해서 구성
- $c_{hidden}$이 주어졌을 때 $d_{hard}$를 예측하는 회귀 모델

Latent Encoder
- Embedding layer가 없는 text encoder와 동일한 구조
- $\{ \mu_{q}, \sigma_{q} \}$는 latent encoder의 output을 single perceptron layer로 projection 하여 생성
Decoder Architecture
- (Nix-TTS Decoder의 목표) decoer $D_{s}$는 분포 $p_{\phi}(x|z)$를 모델링
  - Latent variable $z_{q} \sim N(\mu_{q}, \sigma_{q})$를 입력으로 사용해 연관된 raw waveform $x_{w}$를 decoding
- $D_{s}$는 $D_{t}$와 거의 동일한 architecture를 따르지만 더 적은 parameter 수를 가짐
  - Transposed convolution과 multi-receptive fusion module로 구성된 HiFi-GAN의 generator 구조를 따름
  - 학습과정에서 $D_{s}$에는 teacher 모델의 multi-period discriminator $C_{s}$가 사용됨
- $D_{s}$의 parameter 크기를 줄이기 위해, 기존의 vanilla convolution을 depthwise-separable convolution으로 대체하고 feature map dimension을 절반으로 줄임

- Module-wise Distillation

Encoder Distillation
- Encoder $E_{s}$는 $c$와 $x_{s}$ 사이의 alignment를 학습하고 해당하는 parameter를 예측하여 $q_{\theta}(z|x)$를 모델링하는 것이 주요한 목표
- Alignment objective
  - Forward-sum algorithm를 사용하고 KL-divergence를 최소화해 $A_{soft}$와 $A_{hard}$ 사이의 일치를 유도하여 $A_{soft}$에서 표현되는 $x_{s}$가 주어졌을 때 $c_{hidden}$의 likelihood를 최대화하는 것
  - $L_{ForwardSum}, L_{bin}$
- $q_{\theta}(z|x)$를 모델링하기 위해 $N(\hat{\mu}_{q}, \hat{\sigma}_{q})$와 $q_{\theta}(z|x)$ 사이의 KL-divergence를 최소화
  - 두 분포 모두 Gaussian을 따르므로 closed-form KL-divergence를 최소화
- 최종 Encoder Objective : $L_{E} = L_{ForwardSum} + L_{bin} + L_{kl}$

Decoder Distillation
- Decoder $D_{s}$는 $x_{w}$와 비슷하게 들리는 $\hat{x}_{w}$를 생성하는 것이 목표
  - Least-square adversarial training ($L_{adv, disc}, L_{adv, gen}$), Feature matching loss ($L_{fmatch}$), Mel-spectorgram reconstruction loss ($L_{recon}$)를 사용할 수 있음
  - $C^{l}_{s}$ : discriminator의 $l$번째 layer의 feature map
  - $n_{l}$ : $l$번째 layer의 feature map 수
  - $L$ : $C_{s}$의 layer 수
- 수렴 속도를 빠르게 하고 음성 품질을 향상하기 위해 Generalized Energy Distance (GED) Loss $L_{ged}$를 decoder objective에 augment
  - $d_{spec}(.)$ : multi-scale spectrogram distance
  - $\hat{x}^{a}_{w}$, $\hat{x}^{b}_{w}$ : $N(0,1)$에서 가져온 노이즈 sample에 대해 $D_{s}$에서 생성된 오디오
- 최종 Decoder Objective : $L_{D} = L_{adv, disc} + L_{adv, gen} + L_{fmatch} + L_{recon} + L_{ged}$

3. Experiments

- Settings

Dataset : LJSpeech
Teacher Configuration : VITS
Comparisons : BVAE-TTS, SpeedySpeech with HiFi-GAN

- Speech Synthesis Quality

Nix-TTS는 teacher VITS의 음성 품질을 훌륭하게 유지함
- CMOS로 평가했을 때 Nix-TTS가 VITS 보다 조금 낮게 측정되었지만 parameter 측면에서 82%의 압축률을 보임

Nix-TTS의 명료도를 평가하기 위해 생성된 오디오 sample에 대한 예측 text를 추론하여 Phoneme Error Rate (PER)을 비교
- Nix-TTS는 teacher VITS와 비교해서 0.5%의 차이를 보이면서 더 적은 모델 크기를 가짐

- Model Speedup and Complexity

Intel-i7 CPU와 Raspberry Pi Model 3B에서 Real Time Factor(RTF)와 매개변수 수 측면에서 모델 속도, 복잡도 비교
- Intel-i7 CPU에서 Nix-TTS는 teacher VITS 보다 3.04배 빠른 속도를 보임
  - 모델 크기는 89.34% 감소
- Raspberry Pi Model 3B에서 Nix-TTS는 8.36배 빠른 속도를 보임
  - 모델 크기는 81.32% 감소
Self-attention을 사용하는 대신 계산효율적인 depthwise-separable convolution을 사용함으로써 속도 향상을 이뤄냄
-> 저비용, 자원 제약 환경에서 Nix-TTS의 효율성을 보임 (lightweight!)

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] FastSpeech: Fast, Robust and Controllable Text to Speech (0)	2023.07.23
[Paper 리뷰] FastSpeech2: Fast and High-Quality End-to-End Text to Speech (0)	2023.07.21
[Paper 리뷰] Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech (0)	2023.07.17
[Paper 리뷰] EfficientSpeech: An On-Device Text to Speech Model (0)	2023.07.14
[Paper 리뷰] LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search (0)	2023.07.13

최근에 올라온 글

최근에 달린 댓글

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation

Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation

1. Introduction

2. Method

- Problem Formulation

- End-to-End TTS Teacher

- End-to-End TTS Student

- Module-wise Distillation

3. Experiments

- Settings

- Speech Synthesis Quality

- Model Speedup and Complexity

'Paper > TTS' 카테고리의 다른 글

티스토리툴바