[Paper 리뷰] AILTTS: Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech

티스토리 뷰

Paper/TTS

[Paper 리뷰] AILTTS: Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech

feVeRin 2024. 7. 29. 09:13

AILTTS: Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech

기존의 text-to-speech는 mel-spectrogram과 같은 pre-defined feature에 의존하여 intermediate latent representation을 학습하므로 생성 품질의 한계가 있음
AILTTS
- Latent representation에 prosody embedding을 추가하여 합성 품질을 향상
- Training 중에 mel-spectrogram에서 reference prosody embedding을 추출하고, 추론 시에는 Generative Adversarial Network를 사용하여 text에서 해당 embedding을 추정
논문 (INTERSPEECH 2023) : Paper Link

1. Introduction

Text-to-Speech (TTS)는 크게 autoregressive (AR), non-autoregressive (non-AR) 방식으로 나눌 수 있음
- AR 방식은 고품질 합성이 가능하지만 parallel manner로 동작하기 어렵기 때문에 추론 속도가 느림
- 반면 non-AR 방식은 parallel 합성이 가능하므로 빠른 추론 속도를 가짐
  1. 대표적으로 FastSpeech2는 mel-spectrogram과 같은 intermediate feature를 활용하여 acoustic information을 학습함
  2. BUT, mel-spectrogram에 충분한 speech variance information을 반영하기 어려우므로 합성 품질의 한계가 있음

-> 그래서 text-to-waveform mapping에 필요한 variance information을 제공할 수 있는 lightweight TTS 모델인 AILTTS를 제안

AILTTS
- Speech variance를 represent 하는 prosody-related acoustic feature를 추출하기 위해 prosody encoder를 도입
  - Prosody encoder의 output을 text-to-waveform conversion process를 위해 condition 되는 reference prosody embedding으로 구성함
- 추론 시에는 text input에서 reference prosody embedding을 추정하는 prosody predictor를 채택
  - 이때 estimation power를 향상하기 위해 Generative Adversarial Network (GAN)을 prosody predictor에 적용함

< Overall of AILTTS >

Prosody-related acoustic embedding을 conditioning 하여 single-stage TTS에 speech variance를 효과적으로 반영
Text에서 reference prosody embedding을 효과적으로 추정할 수 있도록 adversarial training을 도입
결과적으로 prosody embedding을 통해 빠른 수렴 뛰어난 합성 품질을 달성

2. Method

- Overview

AILTTS는 LiteTTS를 baseline으로 하여 phoneme encoder, prosody encoder (posterior), prosody predictor (prior), internal aligner, auxiliary predictor, vocoder로 구성됨
- Training process에서는 prosody encoder output (key, value)와 phoneme encoder output (query)에 attention을 적용하여 phoneme-scale prosody embedding $h p r <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>r</mi></mrow></msub></math>$ 을 얻음
- 이후 internal aligner를 사용하여 joint embedding $h p h + h p r <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>h</mi></mrow></msub><mo>+</mo><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>r</mi></mrow></msub></math>$ 을 mel-spectrogram으로 time-align 함
- 최종적으로 얻어진 aligned embedding $I <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>I</mi></math>$ 를 conditioning 한 다음, vocoder를 통해 waveform을 생성함

- Prosody Predictor with Conditional Discriminator

Prosody predictor는 input phonetic embedding $h p h <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>h</mi></mrow></msub></math>$ 로부터 target prosody embedding $h p r <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>r</mi></mrow></msub></math>$ 을 예측하는 것을 목표로 함
- 이때 논문은 prosody embedding의 dynamic nature를 반영하기 위해 discriminator를 활용함
  1. 해당 discriminator는 prosody predictor를 generator로 하여 phonetic information을 condition으로 target prosody embedding $h p r <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>r</mi></mrow></msub></math>$ 과 predicted embedding $˜ h p r <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mi>h</mi><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>r</mi></mrow></msub></math>$ 을 distinguish 함
    - Phonetic embbeding $h p h <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>h</mi></mrow></msub></math>$ 를 condition으로 하는 projection-based conditional discriminator를 사용
  2. 생성된 feature map과 target 간에 feature matching loss를 적용하여 GAN-based training을 stabilize 함
    - 여기서 feature map은 PostConv1D layer 이전의 모든 1D convolution layer의 output으로 정의됨
    - 1개의 PreConv1D layer와 나머지 6개 1D convolution block으로부터 총 7개의 feature map을 추출함
- 논문은 discriminator를 설계하기 위해 2가지의 additional trick을 적용함
  1. 먼저 phoneme domain 대신 aligned domain에서 2개의 prosody embedding을 distinguish 하는 discriminator를 사용
    - Mel-spectrogram의 timing information으로 align 된 $I <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>I</mi></math>$ 가 vocoder의 input으로 사용되기 때문
    - 이때 internal aligner에서 추정된 duration을 활용하여 phoneme-wise embedding의 time scale을 mel-spectrogram의 time scale에 align 함
  2. GPU memory constraint로 인해 vocoder와 동일한 receptive field를 가지도록 discriminator를 구축
- 최종적으로 Least-Squares GAN의 loss를 기반으로 reconstruction loss $L r e c o n <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>r</mi><mi>e</mi><mi>c</mi><mi>o</mi><mi>n</mi></mrow></msub></math>$ , feature matching loss $L f m <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>f</mi><mi>m</mi></mrow></msub></math>$ 을 포함한 prosody predictor loss를 정의:
  (Eq. 1) $L G = E (˜ H p r, H p r) [(D (˜ H p r, H p r) - 1) 2] + L r e c o n + L f m <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>G</mi></mrow></msub><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>H</mi><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>r</mi></mrow></msub><mo>,</mo><msub><mi>H</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>r</mi></mrow></msub><mo stretchy="false">)</mo></mrow></msub><mo stretchy="false">[</mo><mo stretchy="false">(</mo><mi>D</mi><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>H</mi><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>r</mi></mrow></msub><mo>,</mo><msub><mi>H</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>r</mi></mrow></msub><mo stretchy="false">)</mo><mo>-</mo><mn>1</mn><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup><mo stretchy="false">]</mo><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>r</mi><mi>e</mi><mi>c</mi><mi>o</mi><mi>n</mi></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>f</mi><mi>m</mi></mrow></msub></math>$
  (Eq. 2) $L D = E (H p r, ˜ H p r, H p h) [(D (H p r, H p h) - 1) 2 + (D (˜ H p h, H p h)) 2] <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>D</mi></mrow></msub><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><msub><mi>H</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>r</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>H</mi><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>r</mi></mrow></msub><mo>,</mo><msub><mi>H</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>h</mi></mrow></msub><mo stretchy="false">)</mo></mrow></msub><mo stretchy="false">[</mo><mo stretchy="false">(</mo><mi>D</mi><mo stretchy="false">(</mo><msub><mi>H</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>r</mi></mrow></msub><mo>,</mo><msub><mi>H</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>h</mi></mrow></msub><mo stretchy="false">)</mo><mo>-</mo><mn>1</mn><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup><mo>+</mo><mo stretchy="false">(</mo><mi>D</mi><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>H</mi><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>h</mi></mrow></msub><mo>,</mo><msub><mi>H</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>h</mi></mrow></msub><mo stretchy="false">)</mo><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup><mo stretchy="false">]</mo></math>$
  -
  - $H (\cdot) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>H</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mo>\cdot</mo><mo stretchy="false">)</mo></mrow></msub></math>$ : mel-spectrogram의 time-scale에 mapping 된 embedding
  - $F i <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>F</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msup></math>$ : discriminator의 $i <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>i</mi></math>$ -th feature map

Conditional Discriminator, Auxiliary Predictor

- Prosody-Conditioned Internal Aligner

External aligner를 사용하지 않고 phoneme, mel-spectrogram 간의 time-alignment를 학습하기 위해 likelihood-based internal aligner를 채택함
- 이때 encoded feature 간의 $L 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mn>2</mn></math>$ distance로 계산된 probability matrix에 대해 monotonic alignment의 likelihood를 최대화함
  1. 이후 probability matrix에서 most probable path를 select 하고, phoneme duration (binary matrix)를 얻음
  2. 두 matrix 간의 gap은 KL-divergence를 최소화하여 reduce 됨
- AILTTS의 aligner는 joint embedding $h p h + h p r <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>h</mi></mrow></msub><mo>+</mo><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>r</mi></mrow></msub></math>$ 을 phonetic feature로, mel-spectrogram을 acoustic feature로 사용함
  - $h p r <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>r</mi></mrow></msub></math>$ 에는 attention module에 의해 phoneme-level로 mapping 된 local acoustic information이 포함되어 있으므로 $h p h <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>h</mi></mrow></msub></math>$ 만 사용하는 것보다 alignment를 학습하는 것이 쉬워짐
- 논문에서는 FastSpeech2를 따라 stop-gradient를 적용하여 phonetic embedding $h p h <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>h</mi></mrow></msub></math>$ 를 accept 하는 duration predictor를 joint training 함
  - 결과적으로 해당 aligner를 통해 alignment accuracy를 향상하여 duration predictor가 duration을 정확하게 추정하도록 함

- Final Training Loss

전체적인 training loss는:
(Eq. 3) $L t o t a l = L v a r + L a l i g n + L p r e d + L v o c + L a u x <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>o</mi><mi>t</mi><mi>a</mi><mi>l</mi></mrow></msub><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>v</mi><mi>a</mi><mi>r</mi></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>a</mi><mi>l</mi><mi>i</mi><mi>g</mi><mi>n</mi></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>r</mi><mi>e</mi><mi>d</mi></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>v</mi><mi>o</mi><mi>c</mi></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>a</mi><mi>u</mi><mi>x</mi></mrow></msub></math>$
- $L v a r <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>v</mi><mi>a</mi><mi>r</mi></mrow></msub></math>$ : prosody encoder의 output에 적용되는 pitch/energy prediction loss
- $L a l i g n <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>a</mi><mi>l</mi><mi>i</mi><mi>g</mi><mi>n</mi></mrow></msub></math>$ : duration predictor를 포함한 internal aligner에 대한 loss
- $L p r e d, L v o c <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>r</mi><mi>e</mi><mi>d</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>v</mi><mi>o</mi><mi>c</mi></mrow></msub></math>$ : 각각 prosody predictor, vocoder에 대한 loss
- $L a u x <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>a</mi><mi>u</mi><mi>x</mi></mrow></msub></math>$ : target mel-spectrogram과 auxiliary predictor의 output으로 얻어지는 predicted mel-spectrogram 간의 $L 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mn>1</mn></math>$ loss
Auxiliary Predictor
- Aligned embedding $I <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>I</mi></math>$ 에 acoustic information을 추가적으로 제공하기 위해, input이 $I <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>I</mi></math>$ 인 auxiliary predictor를 도입함
- 해당 architecture는:
  1. PostConv1D layer의 output channel 수는 mel-spectrogram dimension으로 설정
  2. 모든 residual 1D convolutional block의 마지막 stage에 layer normalization을 적용
- 이때 receptive field는 vocoder의 receptive field와 동일하므로 auxiliary predictor는 vocoder input $I <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>I</mi></math>$ 에 acoustic information을 효율적으로 제공할 수 있음
  - 특히 auxiliary predictor는 training stage에서만 사용되므로 parameter 수와 complexity는 증가하지 않음

3. Experiments

- Settings

Dataset : LJSpeech
Comparisons : LiteTTS, Glow-TTS, Tacotron2

- Results

전체적으로 AILTTS는 LiteTTS 수준의 적은 parameter 수와 빠른 추론 속도를 가지면서 가장 높은 MOS를 달성함

Ablation study 측면에서 discriminator와 aligner를 제거하면 성능 저하가 발생함

특히 $h p h + h p r <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>h</mi></mrow></msub><mo>+</mo><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>r</mi></mrow></msub></math>$ 을 aligner input으로 사용하는 경우 early stage 동안 기존보다 빠르게 수렴함

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] EmoQ-TTS: Emotion Intensity Quantization for Fine-Grained Controllable Emotional Text-to-Speech (4)	2024.07.31
[Paper 리뷰] QI-TTS: Question Intonation Control for Emotional Speech Synthesis (0)	2024.07.30
[Paper 리뷰] Multi-SpectroGAN: High-Diversity and High-Fidelity Spectrogram Generation with Adversarial Style Combination for Speech Synthesis (0)	2024.07.28
[Paper 리뷰] CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training (0)	2024.07.27
[Paper 리뷰] STEN-TTS: Improving Zero-Shot Cross-Lingual Transfer for Multi-Lingual TTS with Style-Enhanced Normalization Diffusion Framework (0)	2024.07.26

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] AILTTS: Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech

AILTTS: Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech

1. Introduction

2. Method

- Overview

- Prosody Predictor with Conditional Discriminator

- Prosody-Conditioned Internal Aligner

- Final Training Loss

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역