[Paper 리뷰] TriniTTS: Pitch-Controllable End-to-End TTS without External Aligner

티스토리 뷰

Paper/TTS

[Paper 리뷰] TriniTTS: Pitch-Controllable End-to-End TTS without External Aligner

feVeRin 2024. 3. 14. 10:27

TriniTTS: Pitch-Controllable End-to-End TTS without External Aligner

End-to-End architecture, prosody control, on-the-fly duration alignment를 모두 만족하는 text-to-speech 모델이 필요함
- 대부분 two-stage pipeline에 의존적이고 controllability가 부족하기 때문
TriniTTS
- External aligner 없이 pitch control이 가능한 end-to-end text-to-speech 모델
- Alignment search, pitch estimation, waveform generation을 동시에 수행하여 음성의 data 분포를 나타내는 latent vector를 학습
논문 (INTERSPEECH 2022) : Paper Link

1. Introduction

Neural Text-to-Speech (TTS)는 일반적으로 two-stage 방식으로 동작함
- Mel-spectrogram을 생성하는 acoustic model과 mel-spectrogram에서 waveform sample을 합성하는 vocoder로 이루어진 구조
  - 이러한 two-stage pipeline은 두 모델이 ground-truth mel-spectrogram을 사용하여 개별적으로 training 되어야 하므로 비효율적임
  - 따라서 training과 추론의 input data 분포가 서로 mismatch 되므로 합성 품질이 저하됨
- 따라서 two-stage pipeline의 단점을 해결하기 위해 end-to-end 방식의 모델이 필요함
  - End-to-End training은 single 모델을 지원하고, 특정한 intermediate acoustic feature learning에 국한되지 않으므로 모델은 더 강력한 latent feature를 학습할 수 있음
  - 대표적으로 VITS는 conditional variational autoencoder를 기반으로 하는 end-to-end architecture를 통해 TTS 성능을 크게 개선했음
- BUT, 일반적으로 end-to-end 모델은 latent vector가 stochastic하게 sampling 되므로 rhythm과 pitch를 control 할 수 없다는 한계점이 있음

-> 그래서 pitch-controllable end-to-end TTS 모델인 TriniTTS를 제안

TriniTTS
- Deterministic/controllable prosody 모델링을 지원하고, end-to-end 방식으로 동작하고, external aligner를 사용하지 않는 TTS 모델
  - 이를 통해 pitch-controllability와 two-stage pipeline의 비효율성을 해결
- 결과적으로 안정적인 합성 품질을 달성하면서 빠른 추론 속도를 달성함

< Overall of TriniTTS >

External aligner 없이 pitch control이 가능한 end-to-end TTS 모델
Alignment search, pitch estimation, waveform generation을 동시에 수행하여 음성의 data 분포를 나타내는 latent vector를 학습

2. Model Description

TriniTTS는 text encoder, post encoder, decoder, pitch control, alignment search로 구성됨
- 전체적으로 TriniTTS는 VITS와 비슷하게 end-to-end 방식으로 동작하지만, 몇가지 차이점이 존재
  - 서로 다른 alignment search algorithm을 사용
  - TriniTTS에는 pitch-related module이 존재
  - Sampling 시 VITS는 conditional variational autoencoder를 따르지만, TriniTTS는 straight forward deterministic process를 활용
- TriniTTS의 post encoder, decoder의 discriminator, alignment search module은 training 시에만 사용되고 추론 시에는 사용되지 않음
- Multi-speaker 환경에서 speaker embedding은 pitch predictor, duration predictor, alignment search module의 query/key encoder, post encoder, decoder의 input으로 추가적으로 제공됨

- Text Encoder

Pre-processed text token sequence는 embedding lookup table의 embedding에 개별적으로 mapping 됨
- Embedding sequence는 text hidden state $h_{text}$를 학습하기 위해 transformer-based text encoder에 insert 됨

- Post Encoder

Post Encoder는 VITS의 post encoder를 따름
- Non-causal WaveNet residual block을 사용하여 주어진 spectrogram의 latent representation을 capture 하는 역할
  1. 이때 TriniTTS는 post encoder에 대한 input으로 spectrogram $x_{spec}$을 사용하고,
  2. Bridge loss를 사용하여 prior encoder part의 intermediate representation만 guide 함
- Bridge loss는 posterior data 분포의 latent vector와 $x_{text}$를 input으로 제공하는 prior encoder part에 의해 생성된 intermediate representation 간의 $L1$ loss로 정의됨:
  (Eq. 1) $\mathcal{L}_{bridge}=||PostEnc(x_{spec})-PriorEnc(x_{text})||_{1}$
  - $PostEnc(\cdot)$ : Post encoder, $PriorEnc(\cdot)$ : intermediate representation $z$를 생성하는 전체 module

- Decoder

HiFi-GAN을 decoder로 채택하여 intermediate representation $z$를 waveform $\hat{y}_{wav}$로 변환
- 구조적으로 decoder의
  1. Generator $G$는 adversarial training을 사용하여 ground-truth waveform과 유사한 waveform을 생성하는 것을 목표로 함
  2. Discriminator $D$는 multi-scale discriminator, multi-period discriminator를 채택함
- Generator와 discriminator에 대한 adversarial loss $\mathcal{L}_{adv}(G), \mathcal{L}_{adv}(D)$는:
  (Eq. 2) $\mathcal{L}_{adv}(G)=\mathbb{E}[(1-D(G(z)))^{2}]$
  (Eq. 3) $\mathcal{L}_{adv}(D)=\mathbb{E}_{(y_{wav},z)}[(1-D(y_{wav}))^{2}+D(G(z))^{2}]$
- Decoder는 adversarial loss $\mathcal{L}_{adv}$ 외에도 feature matching loss $\mathcal{L}_{fm}$과 reconstruction loss $\mathcal{L}_{recon}$을 사용하여 학습됨
  1. 이때 feature matching loss $\mathcal{L}_{fm}$은:
    (Eq. 4) $\mathcal{L}_{fm}(G)=\mathbb{E}_{(y_{wav},z)}\left[ \sum_{l=1}^{T}\frac{1}{N_{l}} || D^{l}(y_{wav})-D^{l}(G(z)) ||_{1} \right]$
    - $T$ : discriminator $D$에 사용된 layer 수, $D^{l}$ : discriminator $D$의 각 layer의 feature map
    - $y_{wav}$와 $G(z)$의 feature map을 비교하고, 이를 feature map 수 $N$으로 normalize 하여 $\mathcal{L}_{fm}$을 계산함
  2. 생성된 waveform $\hat{y}_{wav}$와 target waveform $y_{wav}$는, 아래의 reconstruction loss $\mathcal{L}_{recon}$을 계산하기 위해 각각 mel-spectrogram $\hat{y}_{mel}, y_{mel}$로 변환됨:
    (Eq. 5) $\mathcal{L}_{recon}= || \hat{y}_{mel}-y_{mel}||_{1}$

- Pitch Control

생성된 음성의 pitch를 각 token 단위로 adjust 하는 것으로 pitch control을 정의함
- 여기서 변경되는 pitch value의 양은 tuned pitch control parameter의 deviation에 비례해야 함
- 이를 통해 pitch predictor와 pitch encoder module을 serializing 할 수 있음
  1. 먼저 pitch predictor는 개별 token에 assign 된 normalized pitch value $\hat{x}_{pitch}$를 예측함
  2. Pitch encoder module은 float pitch value의 sequence를 기반으로 각 token에 대해 pitch hidden state $h_{pitch}$를 생성함
  3. 이후 text hidden state $h_{text}$와 pitch hidden state $h_{pitch}$를 합하여 intermediate representation을 생성
  4. 그리고 intermediate representation에서 text와 pitch data의 joint 분포를 학습
- 추론 시에는 pitch control parameter를 통해 각 token의 예측 pitch value를 adjust 하여 text hidden state $h_{text}$에 추가된 pitch hidden state $h_{pitch}$를 manipulate 함
- 구조적으로는 FastPitch와 동일하게 pitch predictor와 pitch encoder를 구성함
  - 이때 pitch 추정을 위해 spectrogram frame에 대한 ground-truth pitch value가 필요함
  - 따라서 pyin algorithm을 사용하여, yin algorithm을 통해 pre-calculate 된 candidate 중에서 Viterbi decoding을 통해 probabilistic result를 select 함으로써 각 waveform의 pitch value를 추출함
- 이때 pitch predictor에 target으로 제공하는 각 token의 duration에 대한 ground-truth pitch $x_{pitch}$의 평균을 계산하기 위해 alignment search를 통해 얻어진 optimal alignment를 사용함
  - 여기서 pitch loss $\mathcal{L}_{pitch}$는:
  (Eq. 6) $\mathcal{L}_{pitch}= || \hat{x}_{pitch}-x_{pitch}||_{2}^{2}$

- Alignment Search

Alignmetn search는 주어진 spectrogram에서 text의 likelihood를 최대화하는 text token과 spectrogram frame 간의 alignment mapping을 찾는 것을 의미
- 특히 monotonic alignment 가정 하에서, 각 text token과 spectrogram frame 간의 joint likelihood를 최대화하는 optimal alignment를 찾는 것을 목표로 함
  - 이를 위해 TriniTTS는 query를 위한 neural network와 key encoding을 포함하는 duration search algorithm을 채택
- Alignment search module은
  1. Text hidden state $h_{text}$를 input으로 사용하는 query encoder $\Phi_{query}$와 spectrogram $x_{spec}$을 input으로 하는 key encoder $\Phi_{key}$로 구성됨
  2. Query/Key encoder는 alignment map searching을 위해 각각 hidden vector $\phi_{query}, \phi_{key}$를 생성
    - 여기서 soft alignment map $A_{soft}$는 query vector $\phi_{query}$와 key vector $\phi_{key}$간의 learned pairwise affinity를 기반으로 함
  3. 이후 monotonic alignment constraint 하에서, 가능한 모든 alignment candidate가 soft alignment map에서 추출되어 가장 likely 한 path를 탐색
    - Forward-sum algorithm은 가능한 candidate에 대한 CTC loss를 사용하여 loss를 계산하기 위해 사용됨
- 결과적으로 duration predictor는 text hidden state $h_{text}$를 input으로, 추출된 optimal alignment를 target으로 하여 각 token의 duration을 추정하는 방법을 학습함

3. Experiments

- Settings

Dataset : LJSpeech, VCTK
Comparisons : FastPitch, Glow-TTS, VITS

- Results

Single Speaker
- Fundamental frequency $F0$ 측면에서 single speaker에 대한 합성 품질을 비교해 보면, TriniTTS는 FastPitch 보다 더 높은 표준편차 범위를 보임
- 즉, TriniTTS는 더 다양한 pitch 범위로 합성이 가능하다는 것을 의미

MOS를 통한 합성 품질 비교에서도, TriniTTS는 기존 모델들과 비교할만한 성능을 보임
- 특히 pitch control이 반영되는 경우, TriniTTS는 가장 높은 MOS 품질을 보임

Multi Speaker
- VCTK dataset을 통해 multi-speaker 환경에서의 합성 품질을 비교해보면
- 마찬가지로 TriniTTS는 기존 모델들과 비교할만한 성능을 보임
  - 특히 pitch shift의 경우 TriniTTS는 더 우수한 성능을 달성

Inference Speed
- 추론 속도 측면에서 TriniTTS는 CPU, GPU 환경 모두에서 VITS 보다 빠른 속도를 보임
- Parameter 수 측면에서도 TriniTTS는 VITS 보다 효율적임

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] StyleTTS2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models (0)	2024.03.17
[Paper 리뷰] P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting (0)	2024.03.16
[Paper 리뷰] AdaSpeech: Adaptive Text to Speech for Custom Voice (0)	2024.03.12
[Paper 리뷰] nnSpeech: Speaker-Guided Conditional Variational Autoencoder for Zero-Shot Multi-Speaker Text-to-Speech (0)	2024.03.08
[Paper 리뷰] SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-to-Speech Model (0)	2024.03.06

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] TriniTTS: Pitch-Controllable End-to-End TTS without External Aligner

TriniTTS: Pitch-Controllable End-to-End TTS without External Aligner

1. Introduction

2. Model Description

- Text Encoder

- Post Encoder

- Decoder

- Pitch Control

- Alignment Search

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바