[Paper 리뷰] SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-to-Speech Model

티스토리 뷰

Paper/TTS

[Paper 리뷰] SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-to-Speech Model

feVeRin 2024. 3. 6. 09:22

SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-to-Speech Model

Unseen speaker에 대한 similarity를 향상하는 zero-shot text-to-speech 모델이 필요함
SC-GlowTTS
- Flow-based decoder를 기반으로 speaker-conditional architecture를 도입
- Text encoder로써 dilated residual convolutional-based encoder, gated convolutional-based encoder, transformer-based enocoder를 비교
- 추가적으로 text-to-speech 모델을 통해 예측된 spectrogram에 대해 GAN-based vocoder를 adjust 하면 음성 품질과 similarity가 크게 향상됨을 보임
논문 (INTERSPEECH 2021) : Paper Link

1. Introduction

Text-to-Speech (TTS)는 일반적으로 single-speaker 작업에 최적화되어 있음
- Zero-Shot multi-speaker TTS (ZS-TTS)는 few-second sample을 사용하여 training 중에 드러나지 않는 unseen speaker의 음성을 합성하는 것을 목표로 함
  - 이를 위해 Tacotron2를 확장하거나 Generalized End-to-End (GE2E) loss를 사용하는 external embedding을 활용하는 방법들이 제안됨
- 특히 ZS-TTS의 핵심적인 문제는 observed/unobserved speaker 간의 similarity gap임
  - Attentron은 attention mechanism을 사용하여 다양한 reference sample의 detailed style을 추출함
- BUT, ZS-TTS는 여전히 similarity gap이 존재하고, 대부분 Tacotron2에 의존적임
  - 그에 비해 GlowTTS와 같은 flow-based 방식을 도입하면 TTS의 합성 품질과 추론 속도를 크게 개선할 수 있음

-> 그래서 unseen speaker에 대한 zero-shot TTS를 위해 flow-based model을 활용하는 Speaker Conditional GlowTTS (SC-GlowTTS)를 제안

SC-GlowTTS
- GlowTTS를 사용하여 input character를 spectrogram으로 변환
- Angular Prototypical loss를 기반으로하는 external speaker encoder를 통해 speaker embedding vector를 학습하고, HiFi-GAN vocoder를 사용하여 output spectrogram을 waveform으로 변환
- 추가적으로 새로운 speaker에 대한 similarity와 합성 품질을 향상하기 위해, TTS 모델을 통해 예측된 spectrogram에 대해 GAN-based vocoder를 adjust

< Overall of SC-GlowTTS >

11명의 speaker 만으로도 합성이 가능한 zero-shot multi-speaker TTS 모델
결과적으로 고품질의 음성 합성과 real-time 보다 빠른 추론 속도를 가짐

2. Speaker Conditional GlowTTS Model

Speaker Conditional GlowTTS (SC-GlowTTS)는 GlowTTS를 기반으로 몇 가지 수정 사항을 적용함
- 이를 위해 GlowTTS의 transformer-based encoder 외에 residual dilated convolutional network, gated convolutional network를 비교함
  1. Convolutional residual encoder는 ReLU activation 대신 Mish를 채택
  2. Gated convolutional network의 경우, 9개의 convolutional block으로 구성
    - 각 block은 dropout, 1D convolution layer, layer normalization을 포함
- Flow-based decoder는 GlowTTS와 동일한 architecture를 사용
  - Zero-shot TTS로 변환하기 위해 모든 12개 decoder block의 affine coupling layer에 speaker embedding을 적용
- 추가적으로 FastSpeech의 duration predictor를 도입하여 character duration을 예측함
  - 이때 다양한 speaker의 speech characteristic을 capture하기 위해 duration predictor에 speaker embedding을 추가
- 마지막으로 SC-GlowTTS는 HiFi-GAN vocoder를 사용
Training 시 SC-GlowTTS는 Monotonic Alignment Search (MAS)를 사용함
- 이때 decoder objective는 mel-spectrogram과 prior 분포 $P_{Z}$에 embedding 된 input speaker를 condition 하는 것이고, MAS는 $P_{Z}$ prior 분포를 encoder output과 align하는 것을 목표로 함
- 추론 시에는 MAS가 사용되지 않고, $P_{Z}$ prior 분포와 alignment는 text encoder와 duration predictor에 의해 예측됨
- Latent variable $Z$는 prior 분포 $P_{Z}$에서 sampling됨
  - Inverted decoder와 speaker embedding은 mel-spectrogram을 합성한 다음, flow-based decoder를 통해 latent variable $Z$를 변환

3. Experiments

- Settings

Dataset : LibriSpeech, VCTK, VoxCeleb,
Comparisons : Attention Zero-Shot, Tacotron2

- Results

Overall Performance
- 합성 품질(MOS) 측면에서 전체적으로 SC-GlowTTS가 가장 우수한 성능을 보임
  - 마찬가지로 품질 유사도를 나타내는 Sim-MOS, SECS 측면에서도 SC-GlowTTS가 가장 우수함
  - RTF 측면에서도 SC-GlowTTS는 Tacotron2에 비해 빠른 합성 속도를 보임
- SC-GlowTTS의 encoder 측면에서 비교해 보면,
  - Transformer-based encoder를 사용한 SC-GlowTTS-Trans가 가장 우수한 성능을 보임
  - 특히 HiFi-GAN vocoder를 fine-tuning 하는 경우, SC-GlowTTS-Trans의 성능을 더욱 향상할 수 있음

SC-GlowTTS Performance with Few-Shot Speakers
- 11명의 few-shot speaker에 대한 성능을 확인해 보면
- SC-GlowTTS의 SECS는 0.7707, MOS는 3.71, Sim-MOS는 3.93으로 나타남
Zero-Shot Voice Conversion
- Speaker identity에 대한 information은 encoder에 제공되지 않으므로 encoder에 의해 예측된 분포는 speaker identity에 independent 해야 함
  - 즉 SC-GlowTTS는 decoder만을 사용하여 voice conversion을 수행할 수 있음
- 결과적으로 zero-shot conversion을 수행했을 때, SC-GlowTTS는 unseen speaker와 유사한 음성을 합성할 수 있음

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] AdaSpeech: Adaptive Text to Speech for Custom Voice (0)	2024.03.12
[Paper 리뷰] nnSpeech: Speaker-Guided Conditional Variational Autoencoder for Zero-Shot Multi-Speaker Text-to-Speech (0)	2024.03.08
[Paper 리뷰] PortaSpeech: Portable and High-Quality Generative Text-to-Speech (0)	2024.03.02
[Paper 리뷰] Mixer-TTS: Non-autoregressive, Fast and Compact Text-to-Speech Model Conditioned on Language Model Embeddings (0)	2024.02.26
[Paper 리뷰] Meta-StyleSpeech: Multi-Speaker Adaptive Text-to-Speech Generation (0)	2024.02.23

최근에 올라온 글

최근에 달린 댓글

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-to-Speech Model

SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-to-Speech Model

1. Introduction

2. Speaker Conditional GlowTTS Model

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바