[Paper 리뷰] VarianceFlow: High-Quality and Controllable Text-to-Speech using Variance Information via Normalizing Flow

티스토리 뷰

Paper/TTS

[Paper 리뷰] VarianceFlow: High-Quality and Controllable Text-to-Speech using Variance Information via Normalizing Flow

feVeRin 2024. 1. 29. 12:20

VarianceFlow: High-Quality and Controllable Text-to-Speech using Variance Information via Normalizing Flow

Text와 speech 간의 one-to-many 관계를 학습하기 위해 두 가지 방식을 활용할 수 있음
- Normalizing Flow의 사용
- 합성 과정에서 pitch, energy 같은 variance information의 반영
VarianceFlow
- Normalizing Flow를 통해 variance를 모델링하여 더 정확하게 variance information을 예측
- Normalizing Flow의 objective function은 variance와 text를 disentangle 하여 variance control을 가능하게 함
논문 (ICASSP 2022) : Paper Link

1. Introduction

Text-to-Speech (TTS)에서 text와 speech의 one-to-many 관계는 합성을 어렵게하는 주요한 문제임
- Autoregressive (AR) 모델의 경우 speech 분포를 homogeneous conditional factor의 곱으로 factorizing 하여 이를 해결했음
  - BUT, AR 모델은 느린 추론 속도와 exposure bias의 한계가 존재
- 이후 등장한 Non-AR 모델은 one-to-many 문제를 해결하기 위해 2가지 방식을 제시함
  1. Normalizing Flow, Diffusion, Generative Adversarial Network와 같은 생성 framework의 활용
    - 이러한 framework는 Gaussian 분포를 가정하는 Mean Squared Error (MSE) 기반의 일반적인 학습과 달리 target 분포에 대한 pre-defined 분포를 가정하지 않음
    - 결과적으로 MSE loss를 기반으로 학습된 TTS 모델들보다 더 다양한 sample을 만들어 낼 수 있음
  2. Pitch, energy와 같은 variance information을 도입하는 것
    - Text conditioned variance 모델링과 text, variance information conditioned 음성 합성의 2단계로 나누어서 TTS 모델을 구성
    - 이 경우, 합성과정에서 variance value를 조절하여 variance factor를 제어할 수 있음

-> 그래서 Normalizing Flow와 variance information 모두를 활용하는 TTS 모델인 VarianceFlow를 제안

VarianceFlow
- 일반적인 MSE loss 대신 Normalizing Flow (NF)를 variance 모델링에 채택
- NF는 one-to-many 문제에 대해 robust하기 때문에 MSE loss 보다 variance 분포를 더 잘 학습하고, 품질을 향상할 수 있음
- NF는 latent variance representation과 text를 disentangle 하여 variance controllability를 향상 가능

< Overall of VarianceFlow >

Normalizing Flow를 통해 variance를 모델링하고, variance와 text를 disentangle하여 variance control을 가능하게 함
결과적으로 다른 TTS 모델에 보다 우수한 합성 품질을 달성하여 variance 모델링의 우수성을 제시
주어진 pitch value를 온전하게 활용함으로써 더 나은 variance controllability를 보임

2. VarianceFlow

- FastSpeech2

VarianceFlow는 FastSpeech2를 baseline으로 활용함
- FastSpeech2는 variance information으로 pitch와 energy를 사용하는 non-AR 모델
  1. 이를 위해, phoneme sequence는 FFT encoder에 의해 encoding되고 주어진 phoneme duration을 통해 target mel-spectrogram length로 expand 됨
  2. 이후 frame-level에서 추출된 ground-truth pitch와 energy value가 text representation에 projection 됨
  3. 마지막으로 projection 된 variance vector가 expand 된 text representation에 더해지고, FFT decoder를 통해 mel-spectrogram을 생성
- 추론 시 ground-truth variance information을 사용할 수 없기 때문에, FastSpeech2는 variance 모델링을 위해 variance predictor를 도입
  1. Variance predictor는 input text를 기반으로 ground-truth variance value를 예측하기 위해 MSE loss로 학습됨
  2. 이후 variance predictor에 의해 예측된 variance value를 추론에 활용
- 이때 text로부터 pitch나 energy를 예측하는 것 역시 one-to-many 관계를 가지므로 FastSpeech2는 raw pitch value 대신 wavelet transformed pitch spectrogram을 활용하여 pitch predictor를 학습함
  - 유사하게 FastPitch는 phoneme-averaged pitch value를 활용하여 pitch 분포 학습의 어려움을 완화

- VarianceFlow

VarianceFlow는 Normalizing Flow (NF)를 기반으로 pitch, energy와 같은 variance information을 활용하는 TTS 모델
- 학습 과정에서
  1. Pitch, energy variance information은 NF module을 통해 FFT decoder에 제공됨
  2. 이후 variance information과 input text를 기반으로 speech를 생성하는 방법을 학습
    - 이때 NF module은 latent variance 분포를 simple prior 분포와 일치시키는 방법을 학습함
- 추론 시에는 prior에서 latent representation을 직접 sampling 하여 variance information을 FFT block에 제공
NF는 consecutive bijective transform으로 구성되어, 복잡한 variance 분포를 simple prior 분포로 변환함
- Bijective transform을 사용하면 latent variance 분포를 simple prior 분포로 설정함으로써 variance information의 probability density를 계산할 수 있음:
  (Eq. 1) $\log p_{\theta}(x|h) = \log p(z)+\sum_{i=1}^{k} \log | \det(J(f_{i}(x;h)))|$
  (Eq. 2) $z = f_{k}\circ f_{k-1} \circ ... \circ f_{0}(x;h)$
  - $x$ : variance factor, $h$ : hidden representation, $z$ : latent representation, $f_{i}$ : bijective transform
- 결과적으로 prior 분포로써 unit Gaussian 분포를 사용하면, VarianceFlow는 (Eq. 1), (Eq. 2)를 기반으로 variance information의 log-likelihood를 최대화하도록 학습됨
  - 이때 병렬 계산을 위해, VarianceFlow는 rational-quadratic coupling transform을 채택

- Loss Function

FastSpeech2의 pitch, energy predictor에 대한 MSE loss를 VarianceFlow의 NF loss로 대체하는 경우,
- VarianceFlow의 최종 objective function은:
  (Eq. 3) $\mathcal{L}_{total} = \mathcal{L}_{melspec} + \mathcal{L}_{duration} + \alpha \cdot \mathcal{L}_{pitch} + \alpha \cdot \mathcal{L}_{energy}$
  - 첫 2개 term은 FastSpeech2의 기본 loss이고 마지막 2개 term은 각각 pitch와 energy에 대한 negative log-likelihood NF loss
- 이때 각각의 NF loss $\mathcal{L}_{NF}$는 variance 모델링을 학습하고 variance factor를 text와 disentangle 하여, 다음과 같이 decompose 됨:
  (Eq. 4) $\mathcal{L}_{NF} = D_{KL} \left[ q_{\theta}(z|h) || p(z) \right] + H(x|h)$
- VarianceFlow의 NF module은 두 번째 entropy term이 constant이므로, conditional latent variance 분포 $q_{\theta}(z|h)$와 prior 분포 $p(z)$ 간의 Kullback-Leibler divergence를 최소화하도록 학습됨
  - 이는 prior $p(z)$가 $h$와 독립적으로 선택되기 때문에, $z$와 $h$를 disentangle 하도록 학습된다는 것을 의미
  - 결과적으로 $z$에 대한 모델의 responsiveness를 향상하여 정밀한 control이 가능해짐

- Controlling a Variance Factor

NF의 invertibility는 latent representation $z$에서 raw variance value $x$를 얻는 것을 가능하게 함
- 결과적으로 VarianceFlow는 variance factor를 control 하는 것이 가능
- Variance information을 control 하기 위해서는,
  1. Sampling 된 latent representation을 NF의 inverse transform을 통하여 raw variance space로 가져옴
  2. 이후 해당 space에서 variance factor를 manipulating
    - e.g.) raw variance value $x$에 constant를 multiply
  3. 마지막으로 manipulated value는 NF를 통해 FFT decoder에 제공됨

3. Experiments

- Settings

Dataset : LJSpeech
Comparisons : Glow-TTS, FastSpeech2, Tacotron2

- Results

Speech Quality
- FastSpeech2의 경우, phoneme-averaged level에서 variance information을 사용할 때 좋은 MOS를 보임
  - MSE loss를 사용하여 variance predictor를 학습하면 복잡한 variance 분포를 반영하지 못한다는 것을 의미
- VarianceFlow의 경우, frame-level에서 variance information을 제공할 때 좋은 MOS를 보임
  - VarianceFlow가 NF를 기반으로 복잡한 variance 분포를 잘 학습할 수 있다는 것을 의미
- 결과적으로 frame-level variance information으로 학습된 VarianceFlow가 가장 우수한 합성 품질을 보임

Controllability
- Controllability를 비교하기 위해, 다양한 pitch shift coefficient $\lambda$를 통해 예측 pitch value를 조절
  - Decoder에 제공되는 pitch value와 생성된 audio sample에서 추출된 pitch value 사이의 F0 Frame Error (FFE)를 측정
  - 이때 모든 audio sample은 아래의 semitone unit으로 pitch value를 조정하여 생성됨:
  $f_{\lambda} = 2^{\frac{\lambda}{12}} \times f_{0}$
  - $f_{0}$ : shift 되기 전의 pitch value
- Pitch Responsiveness
  - VarianceFlow는 대부분의 $\lambda$에 대해 제공된 pitch를 그대로 활용하므로, VarianceFlow를 사용하면 pitch를 정밀하게 control 할 수 있음
  - 이와 반대로 variance information에 대한 disentangling을 활용하지 못하도록 normalizing flow를 reverting 한 VarianceFlow-reversed의 경우 가장 높은 FFE 값을 보임

$\lambda =0$에서 추출된 pitch와 input pitch의 contour 비교

Speech Quality with Pitch Shift
- Shifted pitch를 사용하여 생성된 sample에 대한 MOS를 비교해 보면,
- VarianceFlow가 모든 $\lambda$에 대해서 FastSpeech2에 비해 더 나은 품질을 달성함
  - Latent representation이 FFT decoder로 전달되는 동안 NF loss가 regularization으로 동작하기 때문

Pitch Shift Scale $\lambda$에 대한 FFE, MOS 비교

Diversity
- Latent sampling에 대해 서로 다른 standard deviation $\sigma = \{ 0.0, 0.667 \}$을 사용하여 sample을 비교
- 결과적으로 VarianceFlow는 latent representation을 사용하지 않고도 자연스러운 음성을 합성 가능
  - Latent representation을 사용하면 sample의 pitch variance가 나타남
- 추가적으로 FFT encoder의 dropout layer를 활용하여 sampling 하면 품질의 큰 손실 없이도 다양한 duration을 가지는, prosody 측면의 다양성이 증가함

동일한 $\sigma$에 대해 생성된 10개 sample들의 f0 contour

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech (0)	2024.01.31
[Paper 리뷰] GenerSpeech: Toward Style Transfer for Generalizable Out-of-Domain Text-to-Speech (0)	2024.01.30
[Paper 리뷰] DiffVoice: Text-to-Speech with Latent Diffusion (0)	2024.01.25
[Paper 리뷰] Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech (0)	2024.01.21
[Paper 리뷰] CyFi-TTS: Cyclic Normalizing Flow with Fine-Grained Representation for End-to-End Text-to-Speech (0)	2024.01.18

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] VarianceFlow: High-Quality and Controllable Text-to-Speech using Variance Information via Normalizing Flow

VarianceFlow: High-Quality and Controllable Text-to-Speech using Variance Information via Normalizing Flow

1. Introduction

2. VarianceFlow

- FastSpeech2

- VarianceFlow

- Loss Function

- Controlling a Variance Factor

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바