[Paper 리뷰] CyFi-TTS: Cyclic Normalizing Flow with Fine-Grained Representation for End-to-End Text-to-Speech

티스토리 뷰

Paper/TTS

[Paper 리뷰] CyFi-TTS: Cyclic Normalizing Flow with Fine-Grained Representation for End-to-End Text-to-Speech

feVeRin 2024. 1. 18. 18:19

CyFi-TTS: Cyclic Normalizing Flow with Fine-Grained Representation for End-to-End Text-to-Speech

End-to-End Text-to-Speech는 unseen data에 대해 적용하는 것은 어려움
One-to-many 문제로 인해 text와 음성 사이에 information gap이 발생하여 mispronunciation 되기 쉽기 때문
CyFi-TTS
- Cyclic normalizing flow를 도입하여 information gap을 해소해 자연스러운 음성을 합성
- Temporal multi-resolution upsampler를 도입하여 fine-grained representation을 점진적으로 생성
논문 (ICASSP 2023) : Paper Link

1. Introduction

End-to-End Text-to-Speech (TTS) 모델은 seen dataset에 대해서 high-fidelity의 음성을 생성할 수 있음
- BUT, unseen transcript에 대해서는 mispronunciation을 포함하는 음성으로 추론될 위험이 있음
  - Acoustic generator가 text와 speech 사이의 sequence를 expanding할 때 one-to-many 문제를 겪기 때문
  - 특히 information gap은 expanded linguistic과 acoustic representation을 matching 하기 어렵게 함
- 이런 문제는 pitch나 energy 같은 additional information을 제공하는 variance adaptor를 사용하여 해결할 수 있음
  - BUT, 대표적인 모델인 VITS의 경우 information gap을 줄이기 위해 posterior encoder를 사용함에도 불구하고 coarse-grained representation으로 부터 관계를 capture 하는 것이 어려움

-> 그래서 information gap을 줄이기 위해 fine-grained representation을 capture 할 수 있는 CyFi-TTS를 제안

CyFi-TTS
- Cyclic Normalizing Flow (CNF)와 Temporal Multi-Resolution Upsampler (TMRU)를 채택
  - TMRU는 여러 frame에 걸쳐 distributed expression을 capture하여 fine-grained representation을 생성
  - CNF는 cyclic representation learning을 통해 linguistic representation을 acoustic representation으로 쉽게 변환함으로써 자연스러운 음성을 추론 가능
- Periodic inductive bias를 적용하여 expression boundary를 explore
  - Training dataset에 모델이 쉽게 overfitting되기 때문
- 결과적으로 unseen transcript에 대해서 CyFi-TTS는 우수한 성능을 달성함

< Overall of CyFi-TTS >

TMRU는 점진적으로 fine-grained representation을 생성하고 information gap을 줄임
여러 frame에 걸쳐 signal을 추출하고 snake function을 사용하여 expression boundary를 explore
CNF는 cyclic representation learning을 통해 linguistic representation을 변환

2. Method

- Prior and Posterior Encoder

CyFi-TTS는 prior encoder와 posterior encoder를 포함함
- Prior encoder는 주어진 text에서 linguistic representation을 추출하고, Monotonic Alignment Search는 text와 음성 사이의 alignment $\mathcal{A}$를 추정
- 이후 coarse-grained linguistic representation은 alignment를 사용하여 acoustic representation으로 expand 됨
- 이때 information gap을 줄이기 위해서 posterior encoder를 도입
  - Posterior encoder는 주어진 음성에서 acoustic representation을 추출하는 역할
  - BUT, posterior encoder를 활용하더라도 sequence를 expand 할 때 information gap이 존재할 수 있음

- Temporal Multi-Resolution Upsampler

CyFi-TTS는 information gap을 해소할 수 있는 fine-grained representation을 생성하기 위해 TMRU를 도입
- 특히 upsampler는 sequence를 점진적으로 expand 하고 여러 frame에 걸쳐 distributed expression을 capture 할 수 있음
- 이를 위해,
  1. $T$ dimension을 따라 Gaussian pooling을 alignment $\mathcal {A} \in \mathbb {R}^{L \times T}$에 적용하고,
    - $L, T$ : 각각 phoneme, frame sequence의 length
  2. Pooled alignment $\mathcal {A}' \in \mathbb {R}^{L \times \frac{1}{2}T}$를 통해 coarse-grained representation $\mathbf{z}_{c} \in \mathbb{R}^{L \times H}$를 pooled representation $\mathbf {z}_{c'} \in \mathbb {R}^{\frac {1}{2} T \times H}$로 변환
    - $H$ : feature dimension
  3. 이후 해당 block을 $N$번 반복하여 sequence를 점차적으로 upsampling
- Linguistic representation은 여러 frame에 spread 되어 있기 때문에, 다양한 kernel size $K$와 dilation $D$를 가지는 convolution을 통해 다양한 signal을 추출
  - 결과적으로 convolution의 합은 signal의 periodic component를 추출함
  - 이때 모든 signal을 고려하기 위해, 각 convolution의 output에 동일한 weight를 부여
TTS 모델은 seen dataset에 대해 bias 되기 쉽고, unseen transcript에 대해서는 robust 하지 않음
- 이를 해결하기 위해, snake function을 사용하여 다양한 expression을 explore 할 수 있는 inductive bias를 적용
  - 특히 BigVGAN은 이러한 방식을 활용하여 bounding region을 뛰어넘는 extrapolation capability를 개선시킴
- Snake fucntion이라고 부르는 periodic activation은 $T$ dimension을 따라 적용됨:
  $f(x) = x + \frac{1}{\alpha} sin^{2} (\alpha x)$
  - $\alpha$ : signal frequency의 periodic component를 regulate 하는 trainable parameter
- Snake function 적용 이후, TMRU는 PostConv를 통해 temporal signal을 추출함
  - 결과적으로 TMRU는 temporal signal을 고려하여 각 frame에 대한 detailed information을 제공함으로써 fine-grained representaion을 생성할 수 있음

- Cyclic Normalizing Flow

Variational AutoEncoder (VAE)는 latent representation $z$와 음성 $y$에 대해 frame-level representation $q(z|y)$를 추출하고 input data $p(y|z)$를 reconstruct 하는 생성 모델
- 이때 $p(z|x)$를 사용하여 text-to-speech 합성을 수행함
  - $x$ : observed text data, $p(z|x) \rightarrow q(z|y)$
- Normalizing Flow (NF)는 단순 분포를 복잡한 분포로 변환할 수 있고, 역방향도 가능함
  - Posterior $q(z|y)$가 prior $p(z'|x)$ 보다 더 복잡하기 때문에 posterior 분포를 $q(z|y) \rightarrow q(z'|y)$로 reduce 하기 위해 NF를 적용함
- 이때 reduced posterior $q(z'|y)$와 prior $p(z'|x)$를 최소화하기 위해 KL-divergence loss $\mathcal{L}_{KL}$을 적용:
  $\mathcal{L}_{KL} = KL[q(z'|y) || p(z'|x)]$
Decoder는 학습 중에 acoustic representation을 posterior $q(z|y)$로 사용함
- BUT, mismatch가 발생할 수 있으므로 decoder는 추론 과정에서 enhanced prior $p(z|x)$를 고려해야 함
  - 특히 NF는 representation을 역방향으로 변환할 수 있지만, prior $p(z'|x)$를 사용하여 변환을 학습하지는 않음
- 따라서, mismatch를 완화하기 위해 linguistic representation을 사용하여 $p(z'|x) \rightarrow p(z|x) \rightarrow p(z''|x)$ 변환을 수행하는 CNF를 도입
  - 이를 통해 $f(f^{-1} (p(z'|x)))$와 $p(z'|x)$ 사이의 representaion을 match 할 수 있음
  - Liguistic representation을 사용하여 양방향 변환을 지원함으로써 학습 과정에서 information gap을 줄일 수 있음
- $p(z''|x) = f(f^{-1} (p(z'|x)))$에서 cycle consistency loss $\mathcal{L}_{cc}$는:
  $\mathcal{L}_{cc} = KL [p(z''|x) || p(z'|x)]$

- Joint Training of the Acoustic Generator and Neural Vocoder

Random noise로부터 sampling 하여 text의 duration 분포를 추정하는 stochastic duration predictor를 활용
- Stochastic duration predictor를 학습시키기 위해 duration loss $\mathcal{L}_{dur}$가 사용됨
- CyFi-TTS는 decoder로써 HiFi-GAN을 채택하여 사용
  - 모델 $G(\cdot)$이 discriminator $D(\cdot)$을 속일 수 있는 high-fidelity의 음성을 생성할 수 있도록 하는 adversarial training을 사용
  - 이때 discriminator는 생성된 음성 $G(x)$와 ground-truth 음성 $y$를 구별하도록 학습됨
- HiFi-GAN은 Multi-Scale Discriminator (MSD)와 Multi-Period Discriminator (MPD)로 구성됨
  - 모델 학습을 위해 least-square loss $\mathcal{L}_{adv}$와 feature matching loss $\mathcal{L}_{fm}$을 사용:
  $\mathcal{L}_{adv}(D) = \mathbb{E} [ ( D(y) - 1)^{2} + ( D(G(x)))^{2} ]$
  $\mathcal{L}_{adv}(G) = \mathbb{E} [ ( D(G(x))-1)^{2}]$
  $\mathcal{L}_{fm} = \mathbb{E} [ || \sum_{i=1}^{L} D^{i} (y) - D^{i}(G(x))||_{1} ]$
  - $L$ : discriminator의 전체 layer 수
  - $D^{i}$ : discriminator의 $i$th layer의 feature map
- 추가적으로 acoustic feature를 match 하기 위해 reconstruction loss $\mathcal{L}_{rec}$를 사용:
  $\mathcal{L}_{rec} = || \phi (G(x))-\phi(y) ||_{1}$
  - $\phi$ : 음성을 mel-spectrogram으로 변환

- Training Objective Function

CyFi-TTS의 최종적인 loss function은:
$\mathcal{L} = \lambda_{kl}\mathcal{L}_{KL} + \lambda_{cc} \mathcal{L}_{cc} + \lambda_{rec} \mathcal{L}_{rec} + \lambda_{dur}\mathcal{L}_{dur} + \lambda_{adv} \mathcal{L}_{adv}(G) + \lambda_{fm} \mathcal{L}_{fm}$
- $\lambda_{kl} = 1.0, \lambda_{cc} = 1.0, \lambda_{rec} = 45.0, \lambda_{dur}= 1.0, \lambda_{adv}= 1.0, \lambda_{fm}= 2.0$

3. Experiments

- Settings

Dataset : LJSpeech, LibriSpeech
Comparisons : VITS, BVAE-TTS, PortaSpeech

- Results

Subjective/Objective Evaluation
- Seen dataset에 대해 CyFi-TTS의 주관적인 합성 품질은 4.02 MOS, Unseen data에 대해서는 3.98 MOS로 측정됨
- Character Error Rate (CER), Word Error Rate (WER) 측면에서 CyFi-TTS는 더 명확한 pronunciation을 보이는 것으로 나타남
  - CyFi-TTS가 upsampling 과정에서 detailed expression을 capture 할 수 있기 때문
- DDUR 결과는 CyFi-TTS의 extended linguistic representation이 speaker-speaking 속도와 유사하다는 것을 의미

CMOS 측면에서 VITS와 비교했을 때도, seen/unseen data에 관계없이 CyFi-TTS가 우수한 성능을 보임

- Ablation Study

Spectrogram을 통해 합성된 음성의 frequency component를 비교해 보면,
- 이때 low frequency는 pronunciation information을 인식하는 formant를 나타내고, high frequency는 voice characteristic을 나타냄
- 결과적으로 CyFi-TTS는 low frequency에서 ground-truth와 유사한 energy/formant를 가짐
- 특히 VITS는 CyFi-TTS에 비해 high frequency에서 voice characteristic이 더 coarse 하게 나타남

각 module별로 CMOS를 비교해 보면, fine-grained representation에 영향을 미치는 TMRU, CNF를 제거하면 CMOS가 저하됨을 확인할 수 있음

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] DiffVoice: Text-to-Speech with Latent Diffusion (0)	2024.01.25
[Paper 리뷰] Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech (0)	2024.01.21
[Paper 리뷰] SpeedySpeech: Efficient Neural Speech Synthesis (0)	2024.01.17
[Paper 리뷰] Personalized Lightweight Text-to-Speech: Voice Cloning with Adaptive Structured Pruning (0)	2024.01.10
[Paper 리뷰] LiteTTS: A Lightweight Mel-spectrogram-free Text-to-wave Synthesizer Based on Generative Adversarial Networks (0)	2024.01.08

최근에 올라온 글

최근에 달린 댓글

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] CyFi-TTS: Cyclic Normalizing Flow with Fine-Grained Representation for End-to-End Text-to-Speech

CyFi-TTS: Cyclic Normalizing Flow with Fine-Grained Representation for End-to-End Text-to-Speech

1. Introduction

2. Method

- Prior and Posterior Encoder

- Temporal Multi-Resolution Upsampler

- Cyclic Normalizing Flow

- Joint Training of the Acoustic Generator and Neural Vocoder

- Training Objective Function

3. Experiments

- Settings

- Results

- Ablation Study

'Paper > TTS' 카테고리의 다른 글

티스토리툴바