[Paper 리뷰] Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

티스토리 뷰

Paper/TTS

[Paper 리뷰] Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

feVeRin 2023. 7. 17. 13:36

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Single-stage 학습을 가능하게 하는 end-to-end 방식의 text-to-speech (TTS) 모델이 제안되었지만 여전히 two-stage TTS 모델들보다 음성 품질이 낮음
Two-stage TTS 모델보다 더 자연스러운 음성을 생성하는 병렬 end-to-end TTS 모델이 필요
VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech)
- Normalizing flow와 적대적 학습 방식을 사용한 variational 추론을 통한 생성 모델링의 expressive power 향상
- Input text로부터 다양한 리듬을 가진 음성을 합성하기 위한 stochastic duration predictor
- Text가 다양한 pitch와 리듬으로 합성될 수 있는 one-to-many 관계를 표현
논문 (ICML 2021) : Paper Link

1. Introduction

TTS 모델은 text preprocessing과 생성 모델링을 분리하는 two-stage 방법론을 사용해 왔음
- 1단계 : text에서 Mel-spectrogram이나 linguistic feature 같은 intermediate speech representation을 생성하는 것
- 2단계 : intermediate representation에서 raw waveform을 생성하는 것
Autoregressive TTS 모델은 사실적인 음성 합성이 가능하지만 순차적인 구조로 인해 병렬화하기 어려움
- 대안으로 Non-autoregressive TTS 모델이 제안됨
  - Text-Spectrogram 단계에서 pretrained autoregressive teacher network를 사용하는 방법
  - Target mel-spectrogram의 likelihood를 최대화하는 alignment를 추정하는 방법
- Generative Adversarial Network (GAN)을 TTS에 적용할 수도 있음
  - Multiple discriminator를 통해 서로 다른 scale, period의 sample을 구분하여 raw waveform을 합성
Two-stage TTS는 병렬화가 어렵고, predefined intermediate feature로 인해 hidden representation도 활용하기 어려움
- FastSpeech 2, EATS와 같은 end-to-end TTS 모델의 등장
  - Mel-spectrogram decoder를 활용해 text representation 학습을 도움
  - 특수한 spectrogram loss를 활용해 target과 생성된 음성 사이의 불일치를 완화
- 하지만 end-to-end TTS는 여전히 음성 합성 품질이 two-stage 모델보다 떨어짐

-> 그래서 two-stage 모델보다 더 자연스러운 음성 합성을 가능하게 하는 병렬 end-to-end TTS 모델인 VITS를 제안

< Overall of VITS >

Expressive power를 향상하기 위해 conditional prior 분포, 적대적 학습 과정에 normalizing flow를 추가
Text에 대한 다양한 one-to-many 관계를 표현할 수 있는 stochastic duration predictor의 사용
Latent 변수에 대한 uncertainty 모델링과 stochastic predictor를 통해 text로는 표현되지 않는 음성 변화를 캡처

2. Method

- Variational Inference

Overview
- VITS는 intractable marginal log-likelihood data log의 Evidence Lower BOund (ELBO) $p_{\theta}(x|c)$를 최대화하는 conditional VAE로 표현될 수 있음
  - $p_{\theta}(z|c)$ : 조건 $c$가 주어졌을 때 latent 변수 $z$의 prior 분포
  - $p_{\theta}(x|z)$ : datapoint x의 likelihood 함수
  - $q_{\psi}(z|x)$ : 근사 posterior 분포
- 이때 Training loss는 negative ELBO로 볼 수 있음
  - Reconstruction loss $log \ p_{\theta}(x|z)$와 KL-divergence $log \ p_{\theta}(z|c)$의 합
  - $z \sim q_{\psi}(z|x)$

Reconstruction Loss
- Reconstruction loss의 traget datapoint로 raw waveform 대신 Mel-spectrogram $\hat{x}_{mel}$ 사용
  1. Latent 변수 $z$를 decoder를 통해 waveform domain $\hat{y}$로 upsample
  2. $\hat{y}$를 Mel-spectrogram domain $\hat{x}_{mel}$로 변환
  3. 예측된 Mel-spectrogram과 target Mel-spectorgram 간의 L1 loss를 reconstruction loss로 사용
- Data 분포의 Laplace 분포 합에 대한 상수항을 무시하는 maximum likelihood estimation으로 볼 수 있음
- Perceptual quality를 향상하기 위해 Mel-spectrogram을 사용
  - Mel-scale이 사람의 청각 시스템과 유사하기 때문
  - Mel-spectrogram 추정에는 STFT와 Mel-scale에 대한 linear projection만 사용되므로 추가적인 parameter가 필요 없음
- Mel-spectrogram 추정은 학습 단계에서만 사용됨
  - 효율적인 end-to-end 학습을 위해 전체 latent 변수 $z$를 upsample 하지 않고 partial sequence만 decoder의 입력으로 사용

KL-Divergence
- Prior encoder $c$의 input condition은 text에서 추출된 phoneme $c_{text}$와 phoneme과 latent 변수 사이의 alignment $A$로 구성
- Alignment는 $|c_{text}| \times |z|$ 차원을 가지는 hard monotonic attention matrix
  - 각 input phoneme이 target speech와 time-alinged 되도록 하는 시간에 대한 차원
  - Alignment에 대한 ground truth가 없기 때문에 각 학습 반복에서 추정해야 함
- Posterior encoder에 대한 high-resolution information을 제공하는 것을 목표로 함
  - Mel-spectrogram이 아닌 target speech에 대한 linear-scale spectrogram $x_{lin}$을 사용

Factorized 정규 분포는 prior, posterior encoder를 parameterize 하기 위해 사용됨
- 실제 같은 sample을 생성하기 위해서는 prior 분포의 expressiveness를 높이는 것이 중요
- Factorized 정규 prior 분포 위에 단순한 분포를 복잡한 분포로 invertible transformation 할 수 있는 normalizing flow $f_{\theta}$를 적용

- Alignment Estimation

Monotonic Alignment Search
- Input text와 target speech 간의 alignment $A$를 추정하기 위해 normalizing flow $f$에 의해 data parameterized likelihood를 최대화하는 Monotonic Alignment Search (MAS)를 사용
  - Candidate alignment는 사람이 순서대로 text를 읽는 것처럼 monotonic하고 non-skipping 한 것으로 제한됨
  - 최적의 alignment를 찾기 위해 dynamic programming이 적용됨

Objective가 정확한 log-likelihood가 아니라 ELBO이기 때문에 MAS를 직접 적용하는 것은 어려움
- ELBO를 최대화하는 alignment를 찾기 위해, latent 변수 $z$의 log-likelihood를 최대화하는 것으로 축소시켜 MAS를 재정의

Duration Prediction from Text
- 추정된 alignment $\sum_{j} A_{i,j}$의 각 row 별로 모든 column을 합산하여 input token $d_{i}$의 duration을 계산할 수 있음
- Duration은 사람이 시간에 따라 다른 speaking rate로 발화하는 방식을 표현할 수 없음
  - 사람과 비슷한 speech 리듬을 생성하기 위해 sample이 주어진 phoneme의 duration 분포를 따르도록 하는 Stochastic Duration Predictor를 사용
Stochastic Duration Predictor
- Maximum likelihood estimation (MLE)를 통해 학습되는 flow-based generative 모델
- MLE를 적용하는 것의 어려움
  - 각 input phoneme의 duration이:
  1. continuous normalizing flow를 사용하기 위해 dequantized 되어야 하는 discrete intger
  2. invertibility로 인해 고차원 transformation을 방해하는 scalar이기 때문
- MLE 적용을 위해 Variational Dequantization, Variationdal Data Augmentation을 사용
  - Duration sequence $d$와 동일한 resolution과 dimension을 가지는 random 변수 $u$, $v$의 도입
  - $u$를 $[0,1)$로 제한하여 $d-u$가 양의 실수가 되도록 함
  - $v$와 $d$를 channel-wise concatenation하여 고차원의 latent representation을 생성
  - 근사 posterior 분포 $q_{\psi}(u,v|d,c_{text})$를 통해 sampling
- Stochatic Duration Predictor의 objective : phoneme duration log-likelihood의 variational lower bound
- Training loss $L_{dur}$ : negative variational lower bound
  - Input condition에 backpropagation을 방지하는 stop gradient operator를 적용
  - Duration predictor 학습이 다른 모듈에 영향을 주지 않게 하기 위함
- Phoneme duration은 stochastic duration predictor의 inverse transformation을 통해 random noise에서 sampling 된 다음 integer로 변환됨

Stochastic Duration Predictor의 objective

- Adversarial Training

Decoder $G$에서 생성된 output과 ground truth waveform $y$를 판별하는 discriminator $D$를 도입
- 음성 합성을 위해 adversarial least-square loss와 feature matching loss를 조합해 사용
  - $T$ : discriminator의 총 layer 수
  - $D^{l}$ : $N_{l}$개의 feautre를 가진 discriminator의 $l$번째 layer의 feature map
- Feature matching loss는 VAE의 element-wise reconstruction loss를 대체

- Final Loss

VAE와 GAN 학습을 조합하여 얻어지는 최종 loss

- Model Architecture

Posterior encoder, Prior encoder, Decoder, Discriminator, Stochastic duration predictor로 구성
- Posterior encoder와 discirminator는 학습 과정에서만 사용됨
Posterior encoder
- WaveGlow, Glow-TTS에서 사용되는 non-casual WaveNet residual block을 사용
  - WaveNet residual block은 dilated convolution과 gated activation, skip connection으로 구성
  - Linear projection layer는 정규 posterior 분포의 평균과 분산을 생성
- Multi-speaker case를 위해 reisudal blcok에서 global conditioning을 사용해 speaker embedding을 추가
Prior encoder
- Input phoneme $c_{text}$를 처리하는 text encoder와 prior 분포의 flexibility를 향상시키는 normalizing flow $f_{\theta}$로 구성
- Text encoder는 relative positional encoding을 사용하는 transformer encoder
  - Hidden representation $h_{text}$는 $c_{text}$에 대한 text encoder와 prior 분포의 평균과 분산을 생성하는 linear projection layer를 통해 얻어짐
- Normalizing flow는 WaveNet residual block으로 구성된 affine coupling layer stack
  - Jacobian determinant가 1인 volume-preserving transformation으로 설계
- Multi-speaker case를 위해 global conditioning을 사용하여 speaker embedding을 추가
Decoder
- HiFi-GAN v1 generator를 사용
  - Transposed convolution의 조합으로 구성되고 multi-receptive field fusion (MRF) 모듈이 사용됨
  - MRF의 output은 서로 다른 receptive field size를 가지는 residual block의 합
- Multi-speaker case를 위해 speaker embedding을 변환하는 linear layer를 추가하여 input latent 변수 $z$에 더해줌
Discriminator
- HiFi-GAN의 multi-period discriminator를 사용
  - Markovian window 기반 sub-discriminator를 혼합해 구성
  - Input waveform의 서로 다른 periodic pattern에서 동작
Stochastic Duration Predictor
- Condtional input $h_{text}$에서 phoneme duration의 분포를 추정하는 역할
- Residual block과 dialted depth-separable convolution layer를 stack 하여 구성
- Neural spline flow를 사용하여 layer를 coupling
  - Monotonic rational-quadratic spline을 활용하여 invertible nonlinear transformation을 취함
  - Neural spline flow는 비슷한 크기의 affine coupling layer 보다 높은 expressiveness를 보임
- Multi-speaker case를 위해 speaker embedding을 처리하는 linear layer를 추가하여 $h_{text}$에 더해줌

3. Experiments

- Settings

Dataset : LJSpeech, VCTK
Comparisons : Tacotron2, Glow-TTS, HiFi-GAN

- Speech Synthesis Quality

VITS가 다른 TTS 모델들보다 ground truth와 가장 비슷한 MOS를 달성
- Stochastic duration predictor가 더 현실적인 phoneme duration을 생성
- End-to-end 방식이 다른 TTS 모델들보다 더 나은 sample을 만들 수 있음

Prior encoder에서 normalization flow과 linear scale spectrogram의 효과를 비교
- Normalization flow가 제거된 경우, 1.52 MOS가 감소
- Linear scale spectrogram을 Mel-spectrogram으로 대체한 경우, 0.19 MOS가 감소

VITS에서 Normalization flow와 Linear spectrogram의 효과 비교

- Generalization to Multi-Speaker Text-to-Speech

Multi-speaker dataset인 VCTK에서도 VITS가 다른 모델들에 비해 더 높은 MOS를 달성
- VITS는 다양한 음성 특징을 표현할 수 있음
- End-to-end 방식으로 다양한 음성 특징을 학습 가능함

- Speech Variation

각 모델에서 생성된 100개의 발화에 대한 길이 비교
- Glow-TTS는 deterministic duration predictor를 사용하기 때문에 고정된 길이의 발화만 생성함
- VITS는 Tacotron과 비슷한 분포를 보임
5개의 화자별로 생성된 100개의 발화 길이 비교
- VITS는 화자에 따라 달라지는 phoneme duration을 학습함을 보임

YIN algorithm으로 추출된 10개의 발화에 대한 F0 contour 비교
- VITS는 다양한 pitch와 리듬으로 음성을 생성할 수 있음

- Synthesis Speed

Predefined intermediate representation이 필요하지 않기 때문에 sampling efficiency와 speed가 크게 향상됨

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] FastSpeech: Fast, Robust and Controllable Text to Speech (0)	2023.07.23
[Paper 리뷰] FastSpeech2: Fast and High-Quality End-to-End Text to Speech (0)	2023.07.21
[Paper 리뷰] Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation (0)	2023.07.15
[Paper 리뷰] EfficientSpeech: An On-Device Text to Speech Model (0)	2023.07.14
[Paper 리뷰] LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search (0)	2023.07.13

최근에 올라온 글

최근에 달린 댓글

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

1. Introduction

2. Method

- Variational Inference

- Alignment Estimation

- Adversarial Training

- Final Loss

- Model Architecture

3. Experiments

- Settings

- Speech Synthesis Quality

- Generalization to Multi-Speaker Text-to-Speech

- Speech Variation

- Synthesis Speed

'Paper > TTS' 카테고리의 다른 글

티스토리툴바