[Paper 리뷰] PortaSpeech: Portable and High-Quality Generative Text-to-Speech

티스토리 뷰

Paper/TTS

[Paper 리뷰] PortaSpeech: Portable and High-Quality Generative Text-to-Speech

feVeRin 2024. 3. 2. 12:13

PortaSpeech: Portable and High-Quality Generative Text-to-Speech

Non-autoregressive Text-to-Speech 모델은 고품질의 음성 합성이 가능하지만 몇 가지 한계가 있음
- VAE는 작은 모델 size로도 long-range semantic feature를 capture 할 수 있지만, 종종 부자연스러운 결과를 생성함
- Normalizing Flow는 frequency bin-wise detail을 reconstruct 하는데 좋지만, 많은 parameter 수를 필요로 함
PortaSpeech
- Lightweight architecture를 사용하여 고품질의 음성 합성을 지원하는 TTS 모델
- Enhanced prior를 포함한 lightweight VAE와 strong conditional input을 가진 flow-based post-net을 활용
- 모델 size를 줄이기 위해 post-net의 affine coupling layer에 grouped parameter sharing을 도입
- Hard word-level alignment와 soft phoneme-level alignment를 결합한 mixture alignment를 채택
논문 (NeurIPS 2021) : Paper Link

1. Introduction

Text-to-Speech (TTS)에서 non-autoregressive architecture를 활용하면 우수한 품질과 빠른 추론 속도를 달성할 수 있음
- 특히 최신 TTS 모델은 아래와 같은 요구사항의 만족을 목표로 하고 있음
  1. Fast : 계산 resource 비용을 줄이고 real-time application으로 확장하기 위해서는 모델의 추론 속도가 빨라야 함
  2. Lightweight : 모델을 mobile/edge device에 배포하기 위해서는 모델 size와 runtime memory가 작아야 함
  3. High-Quality : 합성 품질의 향상을 위해 모델은 adjacent harmonics, unvoiced frame, high-frequency 등의 natural detail을 capture 할 수 있어야 함
  4. Expressive : Expressive한 음성을 생성하기 위해, 적절한 prosody 모델링을 통해 fundamental frequency와 duration을 정확하게 예측할 수 있어야 함
  5. Diverse : 하나의 text input sequence가 주어지면 모델은 다양한 intonation을 가진 sample을 합성할 수 있어야 함
- 위의 목표를 달성할 수 있는 lightweight TTS 모델이 필요

-> 그래서 lightweight architecture를 사용하여 고품질의 음성 합성을 수행하는 PortaSpeech를 제안

PortaSpeech
- VAE가 long-range semantic feature를 capture 하는데 유리하고 Normalizing flow는 frequency bin-wise detail을 reconstruct 하는데 유리하다는 점을 활용
  - Enhanced prior를 갖춘 VAE와 flow-based post-net을 결합하여 고품질의 음성을 생성
- VAE는 작은 모델 size로도 prosody를 잘 caputure 할 수 있기 때문에, lightweight VAE를 도입하여 전체 PortaSpeech의 size를 줄임
  - 추가적으로 grouped parameter sharing을 post-net에 적용하여 모델 size를 압축
- Expressive한 prosody를 얻기 위해 hard word-level alignment와 soft phoneme-level alignment를 결합한 mixture alignment를 활용하는 linguistic encoder를 제시
  - Fine-grained alignment에 대한 의존성을 줄이고 speech-to-text aligner의 부담을 완화

< Overall of PortaSpeech >

VAE와 Normalizing Flow의 장점만을 결합하여 detail 하고 expressive 한 mel-spectrogram을 생성
Prosody를 향상하는 mixture alignment를 도입
Lightweight VAE와 grouped parameter sharing을 통해 더 적은 수의 parameter와 memory 만으로도 고품질의 합성이 가능

2. Background

- VAE

VAE는 $p_{\theta}(x,z) =p(z)p_{\theta}(x|z)$ 형식의 생성 모델
- 여기서 $p(z)$는 latent variable $z$에 대한 prior 분포이고, $p_{\theta}(x|z)$는 decoder와 같이 neural network $\theta$에 의해 parameterize 되는 latent variable $z$가 주어졌을 때 data $x$를 생성하는 likelihood function
- VAE의 latent variable에 대한 true posterior $p_{\theta}(x,z)$는 일반적으로 intractable 하기 때문에, encoder로 볼 수 있는 variational 분포 $q_{\phi}(z|x)$로 이를 근사
- 따라서 parameter $\theta, \phi$는 Evidence Lower BOund (ELBO)를 최대화하여 최적화됨:
  $\log p_{\theta}(x) \geq \mathbb{E}_{q_{\phi}(z|x)}\left[ \log \frac{p_{\theta}(x,z)}{q_{\phi}(z|x)}\right] = \mathbb{E}_{z\sim q_{\phi}(z|x)}\left[ \log p_{\theta}(x|z)-\log \frac{q_{\phi}(z|x)}{p_{\theta}(z)}\right]$
  $= \mathbb{E}_{z\sim q_{\phi}(z|x)}\left[ \log p_{\theta}(x|z)\right] - KL(q_{\phi}(z|x)|| p_{\theta}(z)) \equiv \mathcal{L}(\theta,\phi)$
- VAE를 TTS에 적용한 대표적인 모델인 BVAE-TTS는 bidirectional-inference variational autoencoder를 활용함
  - 이를 통해 ground-truth prosody의 dynamism과 variability를 capture 할 수 있다는 장점이 있음
  - BUT, 생성된 mel-spectrogram은 posterior collapse로 인해 매우 blur 하고 over-smoothing 되어 있어 unnatural 한 음성이 생성됨

- Normalizing Flow

Normalizing Flow는 exact log-likelihood evaluation과 fully-parallel sampling의 장점을 가지는 생성 모델
- Normalizing Flow는 invertible function $f = f_{1}\circ f_{2}\circ ... \circ f_{K}$를 통해 latent variable $z$를 datapoint $x$로 변환함
  - 이때 Gaussian 분포와 같은 단순한 분포에서 sampling 된 latent $z$에 대한 tractable prior $p_{\theta}(z)$를 가정함
- 학습 시 datapoint $x$의 log-likelihood는 variable change rule을 통해 계산됨:
  (Eq. 1) $\log p_{\theta}(x)=\log p_{\theta}(z)+\sum_{i=1}^{K}\log | \det (\textrm{d}h_{i}/\textrm{d}h_{i-1})|$
  - $h_{0} =x, h_{i}=f_{i}(h_{i-1}), h_{K}=z$이고, $| \det (\textrm{d}h_{i}/\textrm{d}h_{i-1})|$은 Jacobian determinant
  - 이후 training data에 대해 (Eq. 1)을 최대화하여 $f_{1}...f_{K}$의 parameter를 학습하고, $g=f^{-1}$이 주어지면 $z\sim p_{\theta}(z)$를 sampling 하고 $\hat{x} = g(z)$를 계산하여 sample $\hat{x}$를 생성할 수 있음
- Flow-TTS와 Glow-TTS는 normalizing flow를 TTS에 적용한 대표적인 모델
  - 이러한 normalizing flow 방식은 VAE의 blurry mel-spectrogram 문제를 해결할 수 있음
  - BUT, 일반적으로 많은 수의 parameter를 요구한다는 단점이 있음

3. PortaSpeech

VAE와 Normalizing Flow의 장점만을 결합하여 PortaSpeech를 구성함
- PortaSpeech의 구조는,
  - Mixture alignment를 포함한 linguistic encoder
  - Enhanced prior를 가지는 variational generator
  - Grouped parameter sharing mechanism이 적용된 flow-based post-net으로 구성됨
- PortaSpeech의 동작은,
  1. Word-level boundary가 있는 text sequence를 linguistic encoder에 제공하여 phoneme과 word level 모두에서 linguistic feature를 추출
  2. Lightweight architecture를 사용하여 expressiveness를 모델링하기 위해 VAE-based generator를 통해 linguistic feature로 condition 된 ground-truth mel-spectrogram에 대한 ELBO를 최대화
    - 이때 prior 분포는 small volume-preserving normalizing flow를 통해 모델링 됨
  3. 마지막으로 생성된 mel-spectrogram의 speech detail을 개선하기 위해 linguistic feature와 variational generator의 output을 따라 condition 된 ground-truth mel-spectrogram에 대한 likelihood를 최대화하여 post-net을 학습함
- PortaSpeech의 추론 과정에서,
  - Text는 linguistic encoder, variational generator의 decoder, reversed flow-based post-net을 차례대로 통과하여 mel-spectrogram으로 변환됨

- Linguistic Encoder with Mixture Alignment

Linguistic feature의 length를 expand 하기 위해 기존의 non-autoregressive TTS 모델은 각 phoneme duration을 예측하는 duration predictor를 도입
- Ground-truth phoneme duration (hard alignment)는 external model이나 monotonic alignment training으로 얻어짐
- BUT, phoneme-level hard alignment는 몇 가지 문제점이 있음
  1. 두 phoneme 간의 boundary는 naturally uncertain 하기 때문에 alignemnt 모델이 정확한 phoneme-level boundary를 얻기 어려우므로 inevitable noise/error가 발생함
  2. Alignment noise/error는 duration predictor의 학습에 영향을 미치고, 추론 과정에서의 prosody 손상으로 이어짐
- 따라서 PortaSpeech는 phoneme-level에서는 soft alignment를 사용하고 word-level에서는 hard alignment를 사용하는 mixture alignment를 linguistic encoder에 도입
PortaSpeech의 linguistic encoder는 phoneme encoder, word encoder, duration predictor, word-to-phoneme attention module로 구성됨
- Word boundary를 가진 input sequence가 있다고 하면,
  1. Linguistic encoder는 phoneme sequence를 먼저 phoneme hidden state $\mathcal{H}_{p}$로 encoding 함
  2. 이후 $\mathcal{H}_{p}$에 word-level pooling을 적용하여 input representation을 얻음
    - Word boundary에 따라 각 word 내부의 phoneme hidden state를 평균하여 얻어짐
  3. 다음으로 word encoder는 word-level hidden state를 expand 하고, word-level duration과 length regulator를 사용하여 target mel-spectrogram $\mathcal{H}_{w}$의 length와 일치시킴
  4. 마지막으로 fine-grained linguistic information을 얻기 위해 $\mathcal{H}_{w}$를 query로, $\mathcal{H}_{p}$를 key/value로 사용하는 word-to-phoneme attention module을 적용함
    - 추가적으로 text-spectrogram alignment의 monotonic nature를 반영하기 위해 attention module 이전에 $\mathcal{H}_{p}, \mathcal{H}_{w}$ 모두에 word-level relative positional encoding embedding을 추가
- 이때 word-level duration을 예측하기 위해 $\mathcal{H}_{p}$를 input으로 사용하고 각 word의 phoneme duration을 summation 하는 duration predictor를 사용
- 결과적으로 mixture alignment mechanism은 fine-grained, soft, close-to-diagonal text-to-spectrogram alignmnet를 유지하면서 noisy phoneme-level alignment 추출과 uncertain duration prediction을 방지

- Variational Generator with Enhanced Prior

Lightweight architecture로 expressive 한 음성을 생성하기 위해 VAE-based mel-spectrogram generator인 variational generator를 구성함
- 일반적인 VAE는 Gaussian 분포와 같은 단순한 분포를 사용하므로 posterior에 대한 strong constraint가 발생함
  - Gaussian prior로 인해 posterior 분포가 mean shift 되어 합성 다양성이 손상됨
- 따라서 prior 분포를 향상하기 위해, $K$ invertible mapping을 통해 단순 분포를 복잡한 분포로 변환하는 small volume-preserving normalizing flow를 도입하고, 그렇게 얻어진 복잡한 분포를 VAE의 prior로 사용함
  1. Flow-based enhanced prior를 적용한 mel-spectrogram generator의 objective는:
    (Eq. 2) $\log p(x|c)\geq \mathbb{E}_{q_{\phi}(z|x,c)}\left[ \log p_{\theta}(x|z,c)\right]-KL(q_{\phi}(z|x,c)|p_{\bar{\theta}}(z|c)) \equiv \mathcal{L}(\phi,\theta,\bar{\theta})$
    - $\phi, \theta, \bar{\theta}$ : 각각 VAE encoder, VAE decoder, normalizing flow-based enhanced prior의 parameter
    - $c$ : linguistic encoder output
  2. Normalizing flow로 인해 (Eq. 2)의 $KL$ term은 simple closed-form solution을 제공하지 않으므로, Monte Carlo method를 통해 $KL$ term을 수정하여 $q_{\phi}(z|x,c)$에 대한 기댓값을 추정함:
    (Eq. 3) $KL(q_{\theta}(z|x,c)|p_{\bar{\theta}}(z|c))= \mathbb{E}_{q_{\phi}(z|x,c)}\left[\log q_{\phi}(z|x,c)-\log p_{\bar{\theta}}(z|c)\right] $
- 학습 시 posterior 분포 $N(\mu_{q},\sigma_{q})$는 variational generator의 encoder를 통해 encoding 됨
  - 이후 $z_{q}$는 reparameterization을 통해 posterior 분포에서 sampling 되고 variational generator의 decoder로 전달됨
  - 이때 posterior 분포는 VP-Flow에 공급되어 standard normal 분포로 변환됨
- 추론 시 VP-Flow는 standard normal 분포의 sample을 variational generator의 prior 분포 sample $z_{p}$로 변환하고, $z_{p}$를 variational generator의 decoder에 전달함

- Flow-based Post-Net

고품질 mel-spectrogram 생성에는 normalizing flow가 가장 효과적임
- 종종 blurry output을 만들어내는 VAE와는 달리 flow-based 모델은 over-smoothing 문제를 해결하고 realistic output을 생성할 수 있음
- 따라서 PortaSpeech는 mel-spectrogram의 rich detail을 모델링하기 위해 flow-based post-net을 도입하여 variational generator의 output을 refine
- Post-Net architecture는 Glow를 채택하여 variational generator와 linguistic encoder에 따라 condition 됨
  1. 학습 시,
    - Post-Net은 mel-spectrogram sample을 isotropic multivariate Gaussian과 같은 latent piror 분포로 변환하고, variable change를 사용하여 data의 exact log-likelihood를 계산
  2. 추론 시,
    - Latent prior 분포에서 latent variable을 sampling 하고 이를 post-net에 reverse로 전달하여 고품질 mel-spectrogram을 생성
Flow-based 모델은 고품질의 생성이 가능하지만 일반적으로 큰 모델 size를 가짐
- 이때 conditional input에는 text와 prosody information만 포함되어 있으므로, Post-Net은 mel-spectrogram의 detail을 모델링하는 것에만 초점을 맞추는 방식으로 capacity를 줄일 수 있음
- 따라서 PortaSpeech는 모델 size를 줄이기 위해, 서로 다른 flow step $(f_{i},f_{i+1}, ..., f_{j})$에서 일부 parameter를 share하는 grouped parameter sharing mechanism을 affine coupling layer에 도입
  - 이를 위해 모든 flow step $(f_{1},f_{2},...,f_{K})$를 여러 group으로 나누고, group 내 flow step 간 coupling layer에서 WaveNet-like network $NN$의 parameter를 share 함
- Grouped parameter sharing은 neural density estimator와 비슷하지만 몇 가지 차이점이 있음
  1. 서로 다른 flow step의 unshared conditional projection layer가 step position을 나타낼 수 있으므로, flow indication embedding을 제거하여 더욱 단순화됨
  2. 모든 flow step에서 parameter를 share 하지 않고 group의 flow step 간에 parameter를 share 함으로써 architecture 변경 없이 parameter 수를 조정 가능

Affine Coupling Layer with Grouped Parameter Sharing

- Training and Inference

Training 시 PortaSpeech의 final loss는 다음과 같이 구성됨:
1. Duration prediction loss $L_{dur}$
  - Log scale에서 예측된 word-level duration과 ground-truth 간의 MSE
2. Reconstruction loss of variational generator $L_{VG}$
  - Variational generator에서 생성된 mel-spectrogram과 ground-truth 간의 MAE
3. KL-divergence of variational generator $L_{KL}$
  - (Eq. 3)에 따라, $L_{KL}= \log q_{\phi}(z|x,c)-\log p_{\theta}(z|c)$
  - $z\sim q_{\phi}(z|x,c)$
4. Post-Net의 negative log-likelihood $L_{PN}$
추론 시 linguistic encoder는
1. 먼저 text sequence를 encode 한 다음 word-level duration을 예측하고, mixture alignment를 통해 hidden state를 expand 하여 linguistic hidden state $\mathcal{H}_{L}$을 얻음
2. Enhanced prior로부터 $z$를 sampling 한 다음, variational generator의 decoder는 linguistic hidden state $\mathcal{H}_{L}$에 따라 condition 된 coarse-grained mel-spectrogram $\bar{M}_{c}$를 생성
  - $\bar{M}_{c}$ : Post-Net 이전의 output mel-spectrogram
3. Post-Net은 randomly sampled latent를 $\mathcal{H}_{L}, \bar{M}_{c}$에 따라 condition 된 fine-grained mel-spectrogram $\bar{M}_{f}$로 변환
4. 최종적으로 $\bar{M}_{f}$는 pre-trained vocoder를 통해 waveform으로 변환됨
  - 이때 PortaSpeech는 hard word-level alignment를 사용하므로 FastSpeech와 같이 추론 시 개별 word의 absolute duration을 지정할 수 있음
  - Silences의 경우, training 시 두 word 사이에 $SIL$과 같은 word boundary symbol을 추가하여 special word $SIL$의 duration을 수정함으로써 silence duration을 제어할 수 있음

4. Experiments

- Settings

Dataset : LJSpeech
Comparisons : Tacotron2, TransformerTTS, FastSpeech, FastSpeech2, Glow-TTS, BVAE-TTS

- Results

Preliminary Analysis on VAE and Flow
- TTS에서 VAE와 Flow 모델 간의 특성을 비교해 보면
- MOS-P : 모델 capacity을 줄이면 flow-based 모델의 prosody가 크게 떨어지는 반면 VAE는 그 영향이 적음
- MOS-Q : 전체적인 합성 품질 측면에서는 flow-based 모델이 VAE 보다 더 우수한 성능을 보임

Performance
- 합성 품질 측면에서는 PortaSpeech (normal)이 기존 TTS 모델과 비교하여 가장 우수한 합성 품질을 보임
- 모델 size 측면에서는 PortaSpeech (small)이 가장 적은 parameter 수와 memory 공간을 필요로 함
- 추론 속도 측면에서는 PortaSpeech (small)이 Tacotron2, TransformerTTS에 비해 각각 5.5배, 45.9배 빠름

Visualization
- 합성된 mel-spectrogram을 비교해 보면, PortaSpeech는 adjacent harmonics, unvoiced frame, high-frequency 등의 영역에서 detail 한 mel-spectrogram을 생성 가능

- Ablation Studies

Enhanced Prior & Post-Net
- Enhanced Prior (EP)를 제거하는 경우, CMOS-P의 저하가 나타나므로 enhanced prior는 prosody 향상에 영향을 미침
- Post-Net (PN)을 제거하는 경우, CMOS-Q의 저하가 나타나므로 post-net은 audio 품질에 큰 영향을 줌

Mixed Alignment
- Mixed Alignment (MA)를 대체하는 경우, CMOS-P와 CMO-Q 모두에서 큰 저하가 나타남
- Mixed Alignment는 prosody, 합성 품질 모두에 영향을 미침
- Average Absolute Error 측면에서 비교해 보면, mixed alignment를 적용한 encoder가 duration을 더 정확하게 예측함

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] nnSpeech: Speaker-Guided Conditional Variational Autoencoder for Zero-Shot Multi-Speaker Text-to-Speech (0)	2024.03.08
[Paper 리뷰] SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-to-Speech Model (0)	2024.03.06
[Paper 리뷰] Mixer-TTS: Non-autoregressive, Fast and Compact Text-to-Speech Model Conditioned on Language Model Embeddings (0)	2024.02.26
[Paper 리뷰] Meta-StyleSpeech: Multi-Speaker Adaptive Text-to-Speech Generation (0)	2024.02.23
[Paper 리뷰] FedSpeech: Federated Text-to-Speech with Continual Learning (0)	2024.02.22

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] PortaSpeech: Portable and High-Quality Generative Text-to-Speech

PortaSpeech: Portable and High-Quality Generative Text-to-Speech

1. Introduction

2. Background

- VAE

- Normalizing Flow

3. PortaSpeech

- Linguistic Encoder with Mixture Alignment

- Variational Generator with Enhanced Prior

- Flow-based Post-Net

- Training and Inference

4. Experiments

- Settings

- Results

- Ablation Studies

'Paper > TTS' 카테고리의 다른 글

티스토리툴바