[Paper 리뷰] DelightfulTTS2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

티스토리 뷰

Paper/TTS

[Paper 리뷰] DelightfulTTS2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

feVeRin 2024. 7. 1. 09:59

DelightfulTTS2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

일반적으로 text-to-speech는 mel-spectrogram을 intermediate representation으로 사용하는 cascaded pipeline을 활용함
BUT, acoustic model과 vocoder는 개별적으로 training 되고, pre-designed mel-spectrogram은 sub-optimal 하다는 한계가 있음
DelightfulTTS2
- Automatically learned speech representation과 joint optimization을 활용한 end-to-end text-to-speech 모델
- Intermediate representation으로써 기존의 mel-spectrogram 대신 vector-quantized auto-encoder 기반의 codec network를 활용
- Acoustic model에 대한 auxiliary loss를 활용하여 acoustic model과 vocoder를 jointly optimize 함
논문 (INTERSPEECH 2022) : Paper Link

1. Introduction

대부분의 text-to-speech (TTS) 모델은 acoustic model과 vocoder로 구성된 two-stage pipeline을 기반으로 함
- 이때 input text의 phoneme/linguistic feature는 mel-spectrogram과 같은 intermediate representation으로 변환되고, 예측된 representation은 vocoder를 통해 waveform으로 변환됨
- 이러한 two-stage TTS는 다음의 문제점을 가지고 있음
  1. Mel-spectrogram은 phase information이 손실되는 Fourier transformation으로 추출되므로 cascaded model에 대한 optimal representation이 아님
  2. Vocoder는 ground-truth mel-spectrogram으로 training 되고, acoustic model에 의해 예측된 mel-spectrogram을 추론에 사용하므로 training-inference mismatch로 인해 품질이 저하됨
- 위 문제를 해결하기 위해서는 fully end-to-end TTS 모델을 구성해야 함
- BUT, end-to-end TTS의 경우 다음의 한계점이 있음
  1. FastSpeech2와 같은 two-stage 모델에 비해 음성 품질이 크게 앞서지 못함
  2. 한편으로 대표적인 end-to-end 모델인 VITS는 training pipeline이 복잡함
  3. Two-stage 방식과 마찬가지로 여전히 mel-spectrogram, linear spectrogram과 같은 Fourier transform representation에 의존적임

-> 그래서 mel-spectrogram 대신 automatically learned frame-level speech representation을 활용하는 end-to-end 모델인 DelightfulTTS2를 제안

DelightfulTTS2
- Mel-spectrogram과 같은 pre-designed feature 대신 intermediate frame-level speech representation을 추출하는 Vector-Quantized Generative Adversarial Network (VQ-GAN)을 활용한 codec network를 구성
  - 해당 VQ-GAN의 encoder를 사용하여 speech representation을 추출하고, multi-stage vector quantizer로 quantize 한 다음, decoder를 통해 waveform을 reconstruction 함
- VQ-GAN encoder로 추출된 intermediate speech representation을 예측하기 위해 acoustic model에 대한 auxiliary loss를 도입해 acoustic model과 vocoder를 jointly optimize 함

< Overall of DelightfulTTS2 >

VQ-GAN 기반의 codec representation과 joint optimization을 활용한 end-to-end TTS 모델
결과적으로 기존 two-stage TTS 보다 우수한 성능을 달성

2. Method

DelightfulTTS2는 크게 2가지 component로 구성됨
1. Codec Network
  - VQ-GAN을 기반으로 encoder와 quantizer를 사용하여 raw waveform을 frame-level feature embedding으로 encoding 하고, decoder를 통해 encoded feature를 reconstruct 함
2. Acoustic Model
  - DelightfulTTS를 기반으로 phoneme sequence에서 encoded feature를 예측하고, 더 나은 음성 품질을 위해 acoustic model과 codec에 대한 joint training을 채택

- Speech Representation Learning with VQ-GAN

Mel-spectrogram 대신 더 나은 speech representation을 학습하기 위해 VQ-GAN을 통해 frame-level speech representation을 학습하는 codec network를 도입함
- 구조적으로 해당 codec network는 아래 그림과 같이 bottom/top layer 사이에 skip-connection이 있는 symmetric encoder-decoder network와 feature bottleneck인 multi-stage vector quantizer로 구성됨
  1. Decoder는 long-term sequential dependency를 학습하기 위해, upsampling stage에서 bidirectional Long Expressive Memory (LEM) layer를 사용한 HiFi-GAN generator로 구성됨
  2. 이때 training 중에 첫 3개 encoder block과 mirrored decoder block 간에 skip-connection을 추가하여 수렴과 joint training을 stabilize 함
  3. Multi-stage vector quantization은 codec encoder 상단에 적용되어 multiple stage에서 encoding 된 feature의 각 frame을 quantize 함
- Adversarial training을 위해 논문은 HiFi-GAN, MelGAN의 multi-scale, multi-period discriminator를 도입함
  - 여기서 동일한 structure를 가지는 3개의 discriminator가 서로 다른 resolution (original, $2 \times <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>2</mn><mo>\times</mo></math>$ downsampling, $4 \times <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>4</mn><mo>\times</mo></math>$ downsampling)의 input audio에 적용됨
- 추가적으로 discreter wavelet transform은 high-frequency component를 정확하게 reproduce 하기 위해 discriminator의 average sampling을 대체하는 데 사용됨

- Acoustic Model based on DelightfulTTS

Acoustic model은 phoneme sequence를 input으로 하여 quantized speech representation을 예측함
- 이때 network는 DelightfulTTS를 기반으로 Conformer block를 가지는 encoder/decoder, one-to-many mapping을 위한 variance adaptor로 구성됨
- Encoder는 phoneme sequence를 hidden representation으로 변환하고 varaiance adaptor는 utterance-level acoustic condition, phone-level acoustic condition, phoneme-level pitch/duration에 대한 information을 예측함
- 최종적으로 decoder는 해당 variance information과 phoneme hidden을 input으로 하여 frame-level speech representation을 예측함

- Joint Training of Acoustic Model and Vocoder

기존의 two-stage cacaded TTS 모델은 acoustic model과 개별적으로 training 된 vocoder로 구성됨
- 해당 two-stage approach는 training-inference feature mismatch 문제와 pre-designed mel-spectrogram으로 인해 waveform reconstruction 성능의 제약이 있음
  - 따라서 DelightfulTTS2는 acoustic model에 대한 scheduled sampling mechanism을 사용하여 acoustic model과 vocoder를 end-to-end joint training 함
- 먼저 acoustic model은 duration predictor, pitch predictor, utterance-level acoustic predictor, phone-level acoustic predictor의 4가지 variance information module으로 구성됨
  1. 이때 ground-truth pitch, utternace-level acoustic embedding, phone-level acoustic embedding은 ground-truth mel-spectrogram에서 추출되고, decoder input으로 phoneme hidden에 추가됨
    - 해당 방식은 training과 infernece 간의 mismatch를 발생시킬 수 있음
  2. 따라서 training 중에 모든 ground-truth feature를 제공하는 대신, utternace-level acoustic condition, phone-level acoustic condition에 대한 schedule sampling mechanism을 적용함
    - 이를 통해 training-inference gap을 줄이고 end-to-end 성능을 향상 가능
- 한편으로 training stage에서 acoustic model output은 quantized speech representation 간의 auxiliary $L 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mn>1</mn></math>$ loss를 사용하고, 예측된 representation에 random segmentation process를 적용하여 vocoder의 input으로 사용됨

- Training Objectives

Discriminator Loss
- VQ-GAN과 end-to-end training의 adversarial objective는 HiFi-GAN을 따름
- 이때 high-frequency loss를 완화하기 위해 average pooling method 대신, discrete wavelet transform으로 downsampling을 대체함
  - 이를 통해 non-stationary signal을 여러 frequency sub-band로 효과적으로 downsampling 할 수 있음
Codec Decoder Loss
- End-to-end training에서 vocoder로 사용되는 codec decoder는 multi-resolution spectrogram loss $L m r s <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>r</mi><mi>s</mi></mrow></msub></math>$ , adversarial loss $L A d v <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>A</mi><mi>d</mi><mi>v</mi></mrow></msub></math>$ , feature matching loss $L f m <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>f</mi><mi>m</mi></mrow></msub></math>$ 으로 구성됨:
  (Eq. 1) $L G = L A d v + L v q + L f m + L m r s <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>G</mi></mrow></msub><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>A</mi><mi>d</mi><mi>v</mi></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>v</mi><mi>q</mi></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>f</mi><mi>m</mi></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>r</mi><mi>s</mi></mrow></msub></math>$
  - $L v q <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>v</mi><mi>q</mi></mrow></msub></math>$ : 모든 vector quantizer에 대한 vector-quantization loss
- 이를 통해 adversarial loss와 jointly optimizing 할 때, realistic result를 생성할 수 있음
Acoustic Model Loss
- Phoneme-level pitch, duration loss, utterance-level/phoneme-level acoustic condition loss로 구성된 acoustic model loss를 활용함:
  (Eq. 2) $L A M = L p i t c h + L d u r + L u t t + L p h o n e + L s s i m + L f e a t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>A</mi><mi>M</mi></mrow></msub><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>i</mi><mi>t</mi><mi>c</mi><mi>h</mi></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>u</mi><mi>r</mi></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>u</mi><mi>t</mi><mi>t</mi></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>h</mi><mi>o</mi><mi>n</mi><mi>e</mi></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>s</mi><mi>i</mi><mi>m</mi></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>f</mi><mi>e</mi><mi>a</mi><mi>t</mi></mrow></msub></math>$
  - $L u t t, L p h o n e <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>u</mi><mi>t</mi><mi>t</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>h</mi><mi>o</mi><mi>n</mi><mi>e</mi></mrow></msub></math>$ : 예측된 utterance-level/phoneme-level acoustic condition vector와 reference encoder에서 추출된 vector 간의 $L 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mn>1</mn></math>$ loss
  - $L p i t c h, L d u r <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>i</mi><mi>t</mi><mi>c</mi><mi>h</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>u</mi><mi>r</mi></mrow></msub></math>$ : 예측된 pitch/duration과 ground-truth 간의 $L 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mn>1</mn></math>$ loss
  - $L s s i m <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>s</mi><mi>i</mi><mi>m</mi></mrow></msub></math>$ : codec encoder에 의한 ground-truth quantized speech representation과 acoustic model로 예측된 representation 간의 similarity (SSIM) loss
  - $L f e a t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>f</mi><mi>e</mi><mi>a</mi><mi>t</mi></mrow></msub></math>$ : 예측된 speech representation과 quantized representation 간의 $L 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mn>1</mn></math>$ loss
- 결과적으로 DelightfulTTS2의 joint training loss는 acoustic model과 audio codec decoder loss를 결합하여 얻어짐:
  (Eq. 3) $L j o i n t = W G * L G + W A M * L A M <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>j</mi><mi>o</mi><mi>i</mi><mi>n</mi><mi>t</mi></mrow></msub><mo>=</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>G</mi></mrow></msub><mo>*</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>G</mi></mrow></msub><mo>+</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>A</mi><mi>M</mi></mrow></msub><mo>*</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>A</mi><mi>M</mi></mrow></msub></math>$
  - $W G, W A M <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>G</mi></mrow></msub><mo>,</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>A</mi><mi>M</mi></mrow></msub></math>$ : loss weight

3. Experiments

- Settings

Dataset : English Speech Dataset (internal)
Comparisons : FastSpeech2, DelightfulTTS

- Results

Speech Quality
- MOS 측면에서 DelightfulTTS2는 가장 우수한 결과를 보임

CMOS 측면에서도 마찬가지로 DelightfulTTS2가 더 선호되는 것으로 나타남

Analysis on Codec Network
- Codec의 reconstruction 성능을 확인해 보면, -0.03 CMOS로 ground-truth와 큰 차이를 보이지 않음

서로 다른 bitrate에 대해서 성능을 비교해 보면, bitrate가 줄어들수록 음성 품질이 저하되는 것으로 나타남

한편으로 speech frame 수는 bitrate와 같이 감소하므로 runtime inference speed를 향상할 수 있음

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning (0)	2024.07.04
[Paper 리뷰] VECL-TTS: Voice Identity and Emotional Style Controllable Cross-Lingual Text-to-Speech (0)	2024.07.02
[Paper 리뷰] XTTS: A Massively Multilingual Zero-Shot Text-to-Speech Model (0)	2024.06.30
[Paper 리뷰] NaturalSpeech: End-to-End Text-to-Speech Synthesis with Human-Level Quality (0)	2024.06.29
[Paper 리뷰] ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading (0)	2024.06.27

Let IT Begin Voice Engineer | 심심하면 앨범 리뷰 올립니다

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] DelightfulTTS2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

DelightfulTTS2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

1. Introduction

2. Method

- Speech Representation Learning with VQ-GAN

- Acoustic Model based on DelightfulTTS

- Joint Training of Acoustic Model and Vocoder

- Training Objectives

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역