[Paper 리뷰] Lightweight Zero-Shot Text-to-Speech with Mixture of Adapters

티스토리 뷰

Paper/TTS

[Paper 리뷰] Lightweight Zero-Shot Text-to-Speech with Mixture of Adapters

feVeRin 2024. 7. 9. 09:17

Lightweight Zero-Shot Text-to-Speech with Mixture of Adapters

Large-scale model을 기반으로 한 zero-shot text-to-speech는 speaker characteristic reproducing에서 우수한 성능을 보이고 있지만, 실제로 활용하기에는 너무 큼
Zero-Shot TTS with MoA
- Mixture of Adapters (MoA) module을 non-autoregressive TTS 모델의 decoder와 variance adaptor에 결합
- Speaker embedding을 기반으로 speaker characteristics와 관련된 적절한 adapter를 선택하여 adatation ability를 향상
논문 (INTERSPEECH 2024) : Paper Link

1. Introduction

Zero-shot Text-to-Speech (TTS)는 acoustic model의 re-training 없이 최소한의 target-speaker utterance 만을 사용하여 adapt 하는 것을 목표로 함
- 특히 VALL-E는 large-scale language model을 활용하여 zero-shot TTS 성능을 극대화함
  - BUT, 상당한 parameter size로 인해 edge device에서 활용하기 어려운 문제가 있음
- 결과적으로 lightweight zero-shot TTS는 제한된 parameter 수를 유지하면서 다양한 speaker의 characteristic을 capture하고 modeling 할 수 있어야 함
  1. 이를 위해 PortaSpeech, LightGrad와 같은 방법들이 제시되었지만, single-speaker 합성만 가능하다는 한계가 있음
  2. 반면 LightTTS는 multi-speaker 합성이 가능하지만, naturalness가 떨어진다는 단점이 있음
- 한편으로 multiple parallel expert module를 구성해 하나 이상의 expert module을 selectively activate하는 Mixture of Expert (MoE)는 expressive, parameter-efficient model을 설계하는데 효과적임
  1. 특히 MoE에서 expert에 대한 weight를 결정하는 과정을 통해 model은 다양한 task를 효과적으로 처리할 수 있음
  2. 추가적으로 MoE는 최소한의 additional parameter 만을 사용하여 training efficiency를 유지하면서 model capacity를 향상 가능

->그래서 MoE의 변형인 Mixture of Adapters (MoA)를 zero-shot TTS에 결합한 방식을 제안

Zero-Shot TTS with MoA
- Speaker embedding으로 gate된 MoA를 활용해 speaker characteristic에 따라 network configuration을 변경
  - 이를 통해 효율적으로 speaker-adapted arrangement를 구성 가능
- 추가적으로 MoA는 large-training dataset을 통해 training 되므로 추론 시 다양한 speaker characteristic을 반영 가능

< Overall of This Paper >

MoA concept를 기반으로 한 lightweight zero-shot TTS model
결과적으로 기존보다 우수한 합성 품질을 달성

2. Method

논문은 MoA module을 사용하여 TTS model을 확장하는 것을 목표로 함
- 먼저 zero-shot TTS는 일반적으로 encoder, decoder가 포함된 TTS model, speaker embedding extractor, vocoder의 3가지 component로 구성됨
  1. 이때 d-vector, x-vector 등의 Self-Supervised Learning (SSL) speech model 기반의 speaker-extraction method를 활용하여 경량화할 수 있음
  2. Vocoder의 경우 Vocos와 같이 inverse STFT를 기반으로 하는 lightweight method를 활용 가능함
- 결과적으로 논문에서는 speaker-extractor와 vocoder를 제외하고, backbone TTS model에 대한 경량화를 수행

- Backbone SSL-based TTS Model

먼저 논문은 SSL-based embedding extractor를 사용하여 input speech sequence를 처리함
- 해당 extractor는 SSL model과 embedding module로 구성되어 SSL model의 speech representation을 fixed-length vector인 speaker embedding으로 변환함
- 구조적으로 embedding module은 weighted-sum, bidirectional GRU, attention으로 구성됨
  1. Weighted-sum component에서 SSL model의 각 layer별 speech representation은 learnable weight를 통해 weighted 된 다음, summation 됨
  2. 이후 bidirectional GRU는 summed representation을 처리하고, hidden state는 attention layer를 통해 aggregate 됨
- 최종적으로 얻어지는 speaker embedding은 TTS model에 전달되고, 이때 TTS model과 embedding module은 jointly training 됨
- 추론 시에는 TTS model과 개별적으로 embedding extractor를 사용하여 d-vector, x-vector와 유사하게 speaker embedding을 미리 계산할 수 있음

- Speaker Embedding based MoA

MoA module은 $N$개의 lightweight bottleneck adapter로 구성됨
- 각 adapter는 layer normalization을 갖춘 2개의 feed-forward layer와 speaker embedding을 통해 adapter의 weight를 결정하는 trainable gating network로 구성
  - 해당 network의 모든 component는 backbone TTS model을 활용하여 jointly training됨
- MoA module은 다음과 같이 formulate 할 수 있음:
  (Eq. 1) $\text{MoA}(\mathbf{x},\mathbf{x}_{e})=\mathbf{x}+\sum_{i=1}^{N}g_{i}(\mathbf{x}_{e})\cdot \text{Adapter}_{i}(\mathbf{x})$
  - $\mathbf{x}\in \mathbb{R}^{D}$ : input, $\mathbf{x}_{e}\in\mathbb{R}^{D_{emb}}$ : speaker embedding
  - $\text{Adapter}_{i}:\mathbb{R}^{D}\rightarrow \mathbb{R}^{D}$ : $N$개 adapter의 집합 $\{\text{Adapter}_{i}(\mathbf{x})\}_{i=1}^{N}$에서의 adapter
  - $g_{i}:\mathbb{R}^{D_{emb}}\rightarrow \mathbb{R}^{N}$ : trainable gating network
- 한편으로 MoA에 대해 2가지의 approach를 고려할 수 있음
  1. Dense MoA : 모든 adapter $N$에 대해 summation을 수행하는 경우
  2. Sparse MoA : top-$k$의 $g_{i}$ weight만 유지하고 나머지 weight는 0으로 설정하는 경우
- 이때 Sparse MoA를 사용하면 추론 시간을 줄이면서 training 중에 많은 adapter를 사용하여 expressiveness를 향상할 수 있음
  - 이후 Adapter의 weight에 걸쳐 balanced load를 보장하기 위해, 논문은 multi-task objective로 model을 training 함
- 결과적으로 loss는 다음과 같이 standard Mean Squared Error (MSE), importance loss $\mathcal{L}_{importance}$와 같은 auxiliary loss로 구성됨:
  (Eq. 2) $\mathcal{L}_{importance}(\mathbf{X})=\left( \frac{\sigma(\text{Importance}(\mathbf{X}))}{\mu(\text{Importance}(\mathbf{X}))} \right)^{2}$
  (Eq. 3) $\text{Importance}(\mathbf{X})=\sum_{\mathbf{x}_{e}\in\mathbf{X}}g_{i}(\mathbf{x}_{e})$
  - $\mathbf{X}\in \mathbb{R}^{n\times D}$ : speaker embedding의 batch
  - $\mu, \sigma$ : sequence의 평균, 표준편차

3. Experiments

- Settings

Dataset : Japanese Speech Dataset
Comparisons : FastSpeech2
- MoA insertion : Small (S), Medium Small (M/S), Medium (M), Large (L)

- Results

전체적인 성능 측면에서 제안하는 방식이 가장 우수한 성능을 달성함

AB test 측면에서도 제안하는 방식이 가장 뛰어난 성능을 달성했음

XAB test 측면에서도 제안하는 방식이 가장 선호됨

Decoder의 첫 번째, 세 번째 layer에 대한 correlation을 비교해 보면, 제안된 방식은 유사한 characteristic을 가진 speaker들에 대해 높은 correlation을 나타냄
- 즉, 제안하는 방식은 characteristic-specific expert를 반영할 수 있음

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models (2)	2024.07.12
[Paper 리뷰] Light-TTS: Lightweight Multi-Speaker Multi-Lingual Text-to-Speech (0)	2024.07.10
[Paper 리뷰] FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis (0)	2024.07.08
[Paper 리뷰] MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech (0)	2024.07.05
[Paper 리뷰] DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning (0)	2024.07.04

최근에 올라온 글

최근에 달린 댓글

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Lightweight Zero-Shot Text-to-Speech with Mixture of Adapters

Lightweight Zero-Shot Text-to-Speech with Mixture of Adapters

1. Introduction

2. Method

- Backbone SSL-based TTS Model

- Speaker Embedding based MoA

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바