[Paper 리뷰] SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

티스토리 뷰

Paper/TTS

[Paper 리뷰] SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

feVeRin 2024. 7. 12. 09:35

SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

Diffusion 기반의 non-autoregressive text-to-speech 모델은 높은 효율성이 요구됨
SimpleSpeech
- Scalar quantization을 수행하는 speech codec인 SQ-Codec을 활용
  - Complex speech signal을 finite, compact scalar latent space로 mapping 하는 역할
- 이후 SQ-Codec의 scalar latent space에 transformer diffusion model을 적용
논문 (INTERSPEECH 2024) : Paper Link

1. Introduction

대부분의 Text-to-Speech (TTS) system은 small-scale high-quality labeled speech dataset에 의존적이므로, training 측면에서 prosody prediction, G2P conversion과 같은 복잡한 pipeline이 요구됨
- 한편으로 AudioLM과 같이 large-scale speech data를 사용하는 경우 TTS pipeline을 크게 단순화할 수 있음
  1. 특히 Language Model (LM)인 VALL-E는 pre-trained EnCodec을 활용하여 speech signal을 discrete token으로 mapping 하고 autoregressive model을 통해 speech token을 얻는 방식으로 동작함
    - BUT, expressive한 음성 합성 성능에 비해 unstable 하고 느린 추론 속도에 대한 문제가 존재함
  2. 해당 문제를 해결하기 위해 NaturalSpeech, SoundStorm, VoiceBox와 같은 Non-AutoRegressive (NAR) 방식을 고려할 수 있음
    - BUT, training pipeline이 복잡하고 phoneme-acoustic alignment에 의존적임
- 이때 alignment information에 대한 의존성을 제거하는 경우, NAR TTS model을 크게 단순화할 수 있지만 다음 문제를 해결해야 함:
  1. Large-scale speech-only dataset을 기반으로 training 되어야 함
  2. NAR 방식으로 고품질 음성을 합성할 수 있어야 함
  3. Specific duration model 없이 duration alignment 문제를 해결해야 함

-> 그래서 위 문제들을 해결해 NAR TTS model을 단순화한 SimpleSpeech를 제안

SimpleSpeech
- Large-scale unlabeled speech data를 활용하여 NAR TTS system을 training 하고, finite/compact latent space에서 speech data를 모델링
  1. 특히 complex speech signal을 compact scalar latent space로 mapping 하는 Scalar-Quantization speech Codec (SQ-Codec)을 도입하고,
  2. 이후 SQ-Codec의 scalar latent space에 대해 Scalar Latent Transformer Diffusion model을 적용
- 기존의 phone-level duration 대신 sentence duration을 사용
  - 이때 condition과 target sequence 간의 fine-grained alignment를 implicitly learning 하기 위해 in-context conditioning strategy를 채택

< Overall of SimpleSpeech >

SQ-Codec과 Scalar Latent Transformer를 기반으로 large-scale speech dataset을 효율적으로 모델링
결과적으로 기존 방식들보다 빠른 속도로 고품질의 음성 합성을 지원

2. Method

SimpleSpeech는 크게 SQ-Codec과 Scalar Transformer Diffusion으로 구성됨

- Text Encoder and Speaker Encoder

먼저 논문은 SimpleSpeech를 training하기 위해 large-scale speech-only dataset을 활용함
- 이를 위해서는 speech sample에 대한 text label이 필요하므로, ASR model인 Whisper-base model을 사용하여 transcript를 얻음
- 이후 pre-trained language model을 사용하여 textual representation을 추출하고, 해당 textual representation을 TTS의 conditional information으로 활용함
  - 이때 zero-shot voice cloning을 위해 XLSR-53의 첫 번째 layer를 사용하여 speaker timbre를 나타내는 global embedding을 추출

- Sentence Duration

FastSpeech2, NaturalSpeech 등의 기존 TTS model은 주로 phone-level duration을 모델링하기 위해 duration predictor를 활용하므로 training pipeline이 복잡해짐
- 따라서 SimpleSpeech는 GPT-3.5-Turbo와 같은 LLM의 in-context learning을 사용하여 sentence-level duration을 모델링함
  - LLM은 sentence의 word 수와 prior knowledge를 기반으로 sentence reading time을 쉽게 추정할 수 있기 때문
- 결과적으로 sentence-level duration을 얻은 다음, word와 latent feature 간의 alignment를 implicitly learning 하도록 하여 음성 합성에 대한 diversity를 향상함
  1. Training stage에서는 waveform length를 기반으로 duration을 직접 얻을 수 있음
    - 여기서 논문은 StableAduio를 따라 timing module을 활용하여 duration을 global embedding으로 encode 함
  2. 추론 stage에서는 LLM에 의해 예측된 noisy sequence의 length를 따라 duration을 예측한 다음, timing module로 전달함

- SQ-Codec

Residual Vector Quantization (RVQ) 기반의 audio codec은 복잡한 loss design이 필요함
- 따라서 논문은 RVQ를 대체하는 scalar quantization을 사용하여 복잡한 training trick 없이 reconstruction loss와 adversarial loss 만으로 training 되도록 함
- 이때 scalar quantization은 complex speech signal을 diffusion model에 적합한 finite, compact latent space로 mapping 할 수 있음
  1. $h \in R T * d <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>T</mi><mo>*</mo><mi>d</mi></mrow></msup></math>$ 가 codec encoder의 output feature를 나타낸다고 하자
    - $T, d <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>T</mi><mo>,</mo><mi>d</mi></math>$ : 각각 frame 수, vector dimension
  2. 그러면 모든 vector $h i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub></math>$ 에 대해 parameter-free scalar quantization module을 사용하여 $h i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub></math>$ 를 fixed scalar space로 quantization할 수 있음:
    (Eq. 1) $h i = torch.tanh (h i), s i = torch.round (h i * S) / S <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo>=</mo><mtext>torch.tanh</mtext><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo stretchy="false">)</mo><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo>=</mo><mtext>torch.round</mtext><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo>*</mo><mi>S</mi><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo>/</mo></mrow><mi>S</mi></math>$
    - $S <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>S</mi></math>$ : scalar space의 scope를 결정하는 hyper-parameter
    - 여기서 rounding operation을 통해 gradient를 얻기 위해, VQ-VAE와 같은 straight-through estimator를 사용
  3. 결과적으로 scalar quantization은 $tanh <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>tanh</mi></math>$ function을 사용하여 feature value를 $[- 1, 1] <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">[</mo><mo>-</mo><mn>1</mn><mo>,</mo><mn>1</mn><mo stretchy="false">]</mo></math>$ 로 mapping 한 다음, rounding operation을 통해 range value를 $2 * S + 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>2</mn><mo>*</mo><mi>S</mi><mo>+</mo><mn>1</mn></math>$ 로 reduce함
    - 이때 얻어지는 value domain을 scalar latent space라고 함
- Encoder and Decoder
  1. SQ-Codec에서 encoder는 5개의 convolution block으로 구성되고 각 block은 2개의 causal 1D-convolutional layer와 1개의 downsampling layer를 가짐
    - Downsample stride는 $[2, 2, 4, 4, 5] <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">[</mo><mn>2</mn><mo>,</mo><mn>2</mn><mo>,</mo><mn>4</mn><mo>,</mo><mn>4</mn><mo>,</mo><mn>5</mn><mo stretchy="false">]</mo></math>$ 로 설정되어 time-dimension을 따라 320배의 downsampling을 수행함
  2. Decoder는 encoder를 반전한 다음, stride convolution 대신 transposed convolution을 사용
- Discriminator and Training Loss
  1. Multi-Scale Discriminator를 사용하여 SQ-Codec의 training을 수행함
  2. 그러면 loss function은 discriminator의 adversarial loss와 time/frequency domain 모두에 대한 reconstruction loss $L r e c <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>r</mi><mi>e</mi><mi>c</mi></mrow></msub></math>$ 로 구성됨
    - $L r e c <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>r</mi><mi>e</mi><mi>c</mi></mrow></msub></math>$ 는 reconstructed wavefor과 ground-truth 간의 $L 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mn>1</mn></math>$ loss, STFT spectrogram과 ground-truth 간의 MSE loss를 포함

- Scalar Latent Transformer Diffusion Models

앞선 SQ-Codec을 기반으로 speech data를 scalar latent space에 mapping한 다음, latent diffusion model을 활용하여 speech data 모델링을 수행함
- 이때 scalar quantization은 각 element의 value range를 효과적으로 제한하므로, simple sampling space를 얻을 수 있음
구조적으로 scalar latent transformer diffusion model은:
1. Transformer-based Diffusion Backbone
  - GPT2-like transformer backbone을 사용하여 12개의 attention layer, 8개의 attention head, 768 model dimension을 가짐
2. In-Context Conditioning
  - Time step $t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi></math>$ 와 condition $c <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>c</mi></math>$ 의 feature를 input sequence의 prefix sequence로 append 하여 사용하고, final block 이후에는 output sequence에서 conditioning sequence를 제거함
  - 이를 통해 구조 변경 없이 standard GPT-like structure를 활용할 수 있음
3. Scalar Latent Diffusion
  - Latent Diffusion Model (LDM)은 complex data distribution에 효과적이지만, SimpleSpeech에서는 SQ-Codec으로 얻어진 finite/compact scalar latent space를 고려해야 함
  - 따라서 Scalar Latent Diffusion network는 Gaussian distribution을 scalar latent space로 transfer 하도록 training 됨
    - 이때 DDPM의 training strategy를 따라 MSE loss를 적용함
  - 한편으로 final output이 scalar latent space에 속하는지 확인하기 위해 scalar quantization operation을 통해 final prediction을 제한함:
    (Eq. 2) $ˆ x 0 = S Q (θ (x T, T, c)) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>=</mo><mi>S</mi><mi>Q</mi><mo stretchy="false">(</mo><mi>θ</mi><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub><mo>,</mo><mi>T</mi><mo>,</mo><mi>c</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$
    - $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$ : neural network, $T <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>T</mi></math>$ : timestep
    - $x T <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub></math>$ : Gaussian distribution의 sampling feature, $c <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>c</mi></math>$ : condition information

3. Experiments

- Settings

Dataset : LibriSpeech
Comparisons
- Codec : EnCodec, DAC, HiFi-Codec
- TTS : VALL-E, Pheme TTS, XTTS, E3TTS, NaturalSpeech

- Results

Performance of SQ-Codec Model
- 기존 codec 모델과 비교하여 논문의 SQ-Codec이 가장 우수한 성능을 보임

Ablation Study of SQ-Codec Model
- SQ-Codec은 $S, d <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>S</mi><mo>,</mo><mi>d</mi></math>$ 의 크기를 늘릴수록 성능이 좋아지는 것으로 나타남

Performance of SimpleSpeech
- 기존 TTS 모델과 비교하여 SimpleSpeech는 가장 우수한 성능을 달성함

MOS 측면에서도 NaturalSpeech, E3TTS 수준의 성능을 보임

Ablation Study of SimpleSpeech
- Ablation study 측면에서 기존 component를 대체하는 경우 성능 저하가 발생함

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] Mega-TTS2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis (0)	2024.07.21
[Paper 리뷰] RAD-MMM: Multilingual Multiaccented Multispeaker Text to Speech (0)	2024.07.19
[Paper 리뷰] Light-TTS: Lightweight Multi-Speaker Multi-Lingual Text-to-Speech (0)	2024.07.10
[Paper 리뷰] Lightweight Zero-Shot Text-to-Speech with Mixture of Adapters (0)	2024.07.09
[Paper 리뷰] FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis (0)	2024.07.08

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

1. Introduction

2. Method

- Text Encoder and Speaker Encoder

- Sentence Duration

- SQ-Codec

- Scalar Latent Transformer Diffusion Models

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역