[Paper 리뷰] MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech

티스토리 뷰

Paper/TTS

[Paper 리뷰] MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech

feVeRin 2024. 10. 19. 11:04

MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech

Zero-Shot Text-to-Speech를 위해서는 많은 training data가 필요하고 기존보다 cost 증가함
MultiVerse
- 기존의 data-driven method 보다 더 적은 training data를 사용하면서 zero-shot 환경에서 Text-to-Speech, Style transfer를 수행하는 multi-task model
- Source-filter theory-based disentanglement를 활용하고 filter-related/source-related representation을 모델링하기 위한 prompt를 도입
- Prosody similarity를 향상하기 위해 prompt-based autoregressive/non-autoregressive method를 결합
논문 (EMNLP 2024) : Paper Link

1. Introduction

Text-to-Speech (TTS)는 다양한 task로 확장되고 있음
- 대표적으로,
  1. Zero-shot TTS는 unseen speaker speech를 합성함
  2. Cross-lingual TTS는 SANE-TTS와 같이 monolingual target speaker에 대해 unseen language를 합성함
  3. Style Transfer는 reference를 target speaker로 transfer 하는 방식
- 특히 zero-shot condition에서 TTS task를 확장하기 위해서는 content, style, speaker identity와 같은 speech component에 대한 generalization을 보장할 수 있어야 함
- 이때 disentanlged representation을 통해 interpretable, controllable feature를 capture 하고 해당 individual component를 학습하여 generalization을 향상할 수 있음
  - BUT, 해당 component의 entangled nature로 인해 general chracteristic을 얻는 것은 어려움
- 한편으로 data-driven method는 large-scale speech dataset에서 generalized acoustic component를 학습할 수 있음
  1. VALL-E, VoiceBox는 massive dataset을 기반으로 unseen data에 대한 robust 합성을 지원하고, Mega-TTS2, NaturalSpeech는 acoustic component를 separately learn 하는 disentangled modeling을 활용함
  2. 여기서 disentangled modeling은 interpretable element를 independently encapsulating 하여 각 component의 generalization을 보장할 수 있음
    - BUT, decoder는 disassembeled element 간의 relationship을 학습해야 하므로 상당한 양의 training dataset가 필요하고, large-scale dataset을 사용하더라도 zero-shot scenario에서는 prosody similarity의 한계가 있음

-> 그래서 zero-shot, cross-lingual condition에서 적은 양의 training data만으로도 expressive TTS를 수행할 수 있는 MultiVerse를 제안

MultiVerse
- Source-filter theory-based decomposed modeling을 활용해 training efficiency를 향상
  1. 구체적으로 speech generation을 filter-/source-related representation generation으로 decompose 한 다음, 각 representation modeling에 prompt speech를 사용함
  2. 이때 두 representation은 mel-spectrogram과 유사한 distribution을 가지는 feature를 생성하므로, decoder는 small data로도 representation 간의 interdependent relationship을 학습할 수 있음
- 추가적으로 expressive modeling을 위해 Autoregressive (AR)/Non-Autoregressive (Non-AR) modeling을 결합

< Overall of MultiVerse >

Source-filter theory-based decomposition과 AR/Non-AR modeling을 결합한 zero-shot multi-task TTS model
결과적으로 zero-shot TTS, cross-lingual TTS, style transfer 등의 다양한 task에서 기존보다 뛰어난 성능을 달성

2. Method

MultiVerse는 speech를 filter/source의 두 component로 disassembling 하여 모델링하고, 다음 3가지 module로 구성됨:
1. Source-Filter Theory-based Acoustic Model
  - Text, speech prompt가 주어진 mel-spectrogram을 생성하는 역할
2. AR Prosody Predictor
  - Input condition에서 prosody-related acoustic feature (duration, pitch, energy)를 예측하는 역할
3. Discriminator
  - Adversarial training을 위함

- Source-Filter Theory Based Decomposed Modeling

MultiVerse의 acoustic model은 source-filter theory를 따라 filter/source generator를 통해 mel-spectrogram을 생성함
- Filter generator는 vocal tract filter-related representation을 생성하고, source generator는 source-related representation을 output함
- 구조적으로는 두 representation 모두 FastSpeech의 feed-forward transformer-based generator로 얻어짐
Filter Representation
- 논문은 filter representation을 speech content, pronunciation, speaker identity-realted information을 포함하지만 prosody에 less-dependent 한 hidden state로 취급함
- Filter generator로 얻어진 해당 representation은 energy embedding과 함께 phoneme representation을 input으로 취하여 modeling 됨
  - 여기서 input feature는 모두 Gaussian upsampling으로 upsampling 됨
Source Representation
- Source representation은 intonation, rhythm, stress와 같은 prosodic information을 포함하는 hidden state로써 content에 대해서는 low-dependency를 가짐
- 이때 source generator는 frame-wise upsampled phoneme-level pitch/energy embedding으로부터 source representation을 생성함
  1. Training 중에 source generator는 ground-truth acoustic feature embedding에서 source representation을 생성하지만,
  2. 추론 시에는 predicted acoustic feature embedding을 사용함
- 이후 MultiVerse는 filter/source representation을 modeling 하기 위해 prompt-based modulation을 채택함
  - 특히 vocal tract filter와 sound source는 speaker characteristic의 영향을 받으므로 prompt speech는 두 generator 모두에 반영됨
- 구체적으로:
  1. Prompt speech의 mel-spectrogram은 speech prompt encoder에서 hidden state를 얻기 위한 input으로 사용되고,
  2. FiLM layer의 parameter는 NaturalSpeech와 같이 generator input과 prompt encoder output 간의 cross-attention output에서 예측됨
  3. 이후 FiLM layer는 generator 내의 representation을 modulate 함

- Increasing Filter Capacity of the Acoustic Decoder

앞서 얻어진 두 representation을 결합하여 얻어지는 intermediate representation은 coarse mel-representation으로써 sound source와 vocal tract filter 간의 interaction과 유사함
- 특히 해당 fusion은 source-filter model의 frequency response를 따르기 때문에 coarse mel-represnetation은 speech의 high-dimensional feature와 closely-related 되어 있음
  - 따라서 MultiVerse의 acoustic decoder는 filter와 source representation 간의 interdependent relation을 효과적으로 학습할 수 있음
- Speech content, style 등의 다양한 information을 coarse mel-representation 내에서 preserving 하면서 mel-spectrogram을 생성하려면 acoustic decoder의 filter capacity를 증가시켜야 함
  1. 여기서 논문은 filter capacity를 늘리기 위해 acoustic decoder의 transformer block을 sample-adaptive kernel selection-based convolution layer로 대체함
    - 이때 speech prompt에 적합한 convolution filter를 찾는 것을 목표로 함
  2. 구체적으로 각 convolution layer의 learnable filter는 global style embedding에서 예측된 weight를 기반으로 weighted sum으로 얻어지고, 이후 aggregated filter는 modulate/de-modulate 됨
    - Global style embedding은 pre-trained speaker encoder에서 얻어짐

- Two-Stage Prosody Modeling

MultiVerse의 prosody modeling은 2-stage로 구성됨:
1. 먼저 Prosody predictor는 acoustic feature를 autorgressive 하게 modeling 하고
2. Source generator는 해당 acoustic feature를 사용하여 latent space의 prosody를 non-autoregressively modeling 함
Autoregressive Prosody Modeling
- AR prosody predictor는 acoustic feature (duration, pitch, energy)의 time-varying distribution을 conditional language modeling task로 modeling 함
  - 이때 AR prosody predictor는 주어진 text와 prompt condition에 적합한 acoustic unit을 예측하는 것을 목표로 함
  - 특히 prosody의 time-dependent nature와 large-variation을 modeling 해야 하므로 AR apporach를 채택
- 먼저 prosody predictor는 $\mathbf{d},\mathbf{p},\mathbf{e}$를 각각 duration, pitch, energy unit sequence라고 할 때 phoneme sequence $\mathbf{x}=\{x_{1},x_{2},...,x_{T}\}$에 해당하는 acoustic unit $\mathbf{c}_{t}=\{\mathbf{d}_{t},\mathbf{p}_{t},\mathbf{e}_{t}\}$를 예측하도록 training 됨
  1. 여기서 unit sequence의 각 value는 index에 해당하고 normalized acoustic sequence를 quantizing 하여 얻어짐
  2. 그러면 speech prompt $\mathbf{r}$과 phoneme sequence $\mathbf{x}$에 따라 conditioning 된 prosody modeling은:
    (Eq. 1) $p(\mathbf{c}|\mathbf{x},\mathbf{r};\theta_{ARP})=\prod_{t=0}^{T} p(\mathbf{c}_{t}|\mathbf{c}_{<t},\mathbf{x},\mathbf{r};\theta_{ARP})$
    - $\theta_{ARP}$ : AR prosody predictor의 parameter
  3. 추가적으로 prompt-based in-context learning을 사용하여 prosody를 modeling 하기 위해 VALL-E와 같이 phoneme sequence와 prompt를 prefix로 사용함
- 해당 AR approach는 Mega-TTS의 Prosody-Latent Language Model (P-LLM)에서도 사용되어 prosody hidden state의 vector-quantized codebook을 autorgressively modeling 함
  - BUT, vector-quantization 성능은 아래 그림과 같이 training data의 양에 의존적인 반면 MultiVerse의 AR prosody predictor는 acoustic feature unit을 modeling 하므로 data-efficient 함
  - 추가적으로 P-LLM은 training data 내에 존재하는 특정 language로 제한되고 alignment information이 필요하지만 AR predictor는 prompt speech에 제한이 없음
Non-Autoregressive Prosody Modeling
- Non-AR prosody modeling은 time-dependent prosody feature에서 frame-level prosody를 refine 함
- 여기서 source-filter generator는 acoustic feature embedding을 source representation으로 변환하여 attention mechanism과 modulation에 의한 prompt의 prosody characteristic을 반영

Vector Quantization-based Prosody Modeling과 AR Predictor 비교

- Learning Objectives

Learning objective는 reconstruction loss, adversarial loss, acoustic feature loss로 구성됨
- Reconstruction loss는 생성된 mel-spectrogram과 ground-truth 간의 $L1$ loss
- Adversarial loss는 LSGAN을 기반으로 2D patch unit length $\{32, 64, 128\}$를 가지는 Multi-Window Discriminator를 결합하여 사용
- Acoustic feature loss는 prosody predictor output과 ground-truth acoustic unit을 비교하여 각 acoustic feature에 대한 cross-entropy loss의 합을 계산

- Multi-Task TTS

MultiVerse는 input condition에 따라 multiple task를 수행할 수 있음
- Zero-Shot TTS의 경우 unseen speaker prompt를 input으로 사용하고, Cross-lingual TTS의 경우 speech prompt와 input text에 서로 다른 language를 전달하여 수행됨
- Speech Style Transfer는 아래 그림과 같이 2가지 prompt를 서로 다른 module에 input 하여 수행됨
- 추가적으로 각 task를 결합하여 zero-shot cross-lingual TTS나 zero-shot style transfer, zero-shot cross-lingual style transfer 등을 수행할 수 있음

3. Experiments

- Settings

Dataset : LibirTTS, VCTK, AI-Hub
Comparisons : GANSpeech, YourTTS

- Results

Zero-Shot Scenario
- 전체적으로 MultiVerse가 가장 우수한 zero-shot 성능을 보임

Comparison with Data-Driven Models
- Data-driven large-scale model인 VALL-E, Mega-TTS, NaturalSpeech, VoiceBox와 MultiVerse를 비교
- MOS 측면에서 MultiVerse는 VALL-E 보다 우수한 성능을 보임

Cross-/Intra-lingual synthesis 측면에서도 MultiVerse의 성능이 가장 뛰어남

Speech Style Transfer
- Style transfer 측면에서도 MultiVerse가 더 뛰어남

Acoustic Feature Modeling
- Pitch/duration distribution을 비교해 보면 MultiVerse는 ground-truth와 비슷한 distribution을 가짐

$F_{0}$ contour 측면에서도 MultiVerse는 manipulated pitch index를 효과적으로 반영함

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-Robust Expressive TTS (0)	2024.11.10
[Paper 리뷰] GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech (0)	2024.11.09
[Paper 리뷰] PL-TTS: A Generalizable Prompt-based Diffusion TTS Augmented by Large Language Model (2)	2024.10.12
[Paper 리뷰] ClariTTS: Feature-ratio Normalization and Duration Stabilization for Code-Mixed Multi-Speaker Speech Synthesis (0)	2024.10.09
[Paper 리뷰] VoiceTailor: Lightweight Plug-In Adapter for Diffusion-based Personalized Text-to-Speech (0)	2024.10.03

최근에 올라온 글

최근에 달린 댓글

« 2025/12 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech

MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech

1. Introduction

2. Method

- Source-Filter Theory Based Decomposed Modeling

- Increasing Filter Capacity of the Acoustic Decoder

- Two-Stage Prosody Modeling

- Learning Objectives

- Multi-Task TTS

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바