[Paper 리뷰] MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

티스토리 뷰

Paper/Language Model

[Paper 리뷰] MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

feVeRin 2025. 2. 23. 12:27

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

Large-scale text-to-speech system은 autoregressive/non-autoregressive 방식으로 나눌 수 있음
- Autoregressive 방식은 robustness와 duration controllability 측면에서 한계가 있음
- Non-auotregressive 방식은 training 중에 text, speech 간의 explicit alignment information이 필요함
MaskGCT
- Text, speech supervision 간의 explicit alignment information과 phone-level duration prediction이 필요 없는 fully non-autoregressive text-to-speech model
- Two-stage framework를 활용하여 text로부터 semantic token을 예측한 다음, 해당 semantic token을 condition으로 하는 acoustic token을 예측
- Mask-and-predict learning을 통해 주어진 condition과 prompt에 따라 masked token을 생성
논문 (ICLR 2025) : Paper Link

1. Introduction

SpearTTS, VALL-E, CLaM-TTS, VoiceCraft 등의 large-scale zero-shot text-to-speech (TTS) system은 주로 autoregressive (AR), non-autoregressive (NAR) model을 활용하여 구성됨
- 먼저 AR-based system은 speech를 discrete token으로 quantize 한 다음, decoder-only model을 사용하여 token을 autoregressively generate 함
  - BUT, 해당 AR 방식은 poor robustness, slow speed의 문제가 있음
- 한편 VoiceBox, Mega-TTS2와 같이 diffusion, GAN, flow matching을 활용하는 NAR-based system은 explict alignment information과 phoneme-level duration이 필요하므로 pipeline이 복잡해지고 less diverse speech가 생성됨
- 최근에는 AR, NAR model 대신 masked generative transformer가 뛰어난 generation 성능을 보이고 있음
  1. Masked generative transformer는 mask-and-predict paradigm을 통해 training 되고 추론 시에는 iterative parallel decoding을 활용함
  2. 특히 SoundStorm은 speech semantic token을 condition으로 SoundStream에서 추출한 multi-layer acoustic token을 predict 하기 위해 masked generative transformer를 도입했음
    - BUT, AR model의 semantic token을 input으로 receive 하므로 masked generative model을 활용하지 못함
    - 그 외에도 speech-text alignment supervision과 phone-level duration prediction이 필요함

-> 그래서 TTS를 위한 masked generative transformer 기반의 fully non-autoregressive model인 MaskGCT를 제안

MaskGCT
- Mask-and-Predict learning paradigm을 활용한 two-stage framework를 채택
  1. First stage에서 text-to-semantic (T2S) model은 explicit duration prediction 없이 text token sequence와 prompt speech semantic token sequence를 prefix로 사용함
    - 이후 in-context learning을 통해 masked semantic token을 predict 함
  2. Second stage에서 semantic-to-acoustic (S2A) model은 semantic token을 사용하여, prompt acoustic token이 있는 RVQ-based speech codec에서 추출된 masked acoustic token을 predict 함
  3. 추론 시에는 text sequence가 주어졌을 때, few iteration step으로 다양한 specified length의 semantic token을 생성함
- 추가적으로 기존의 $k <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi></math>$ -means 대신 VQ-VAE를 채택하여 speech self-supervised semantic token embedding을 quantize
  - 이를 통해 single codebook으로도 semantic feature의 information loss를 minimize 함

< Overall of MaskGCT >

Masked Generative Transformer를 활용한 fully non-autoregressive TTS model
결과적으로 기존보다 뛰어난 합성 성능을 달성

2. Method

- Background: Non-Autoregressive Masked Generative Transformer

Some data의 discrete representation sequence $X <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow></math>$ 가 주어졌을 때, $X t = X ⊙ M t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mo>⊙</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">M</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 를 $X <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow></math>$ 의 token subset을 해당 binary mask $M t = [m t, i] N i = 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">M</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mo stretchy="false">[</mo><msub><mi>m</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>,</mo><mi>i</mi></mrow></msub><msubsup><mo stretchy="false">]</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>N</mi></mrow></msubsup></math>$ 으로 mask 하는 process라고 하자
- $m t, i = 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>m</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>,</mo><mi>i</mi></mrow></msub><mo>=</mo><mn>1</mn></math>$ 인 경우, $x i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub></math>$ 를 special $[MASK] <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>[MASK]</mtext></math>$ token으로 replace 하고, $m t, i = 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>m</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>,</mo><mi>i</mi></mrow></msub><mo>=</mo><mn>0</mn></math>$ 인 경우, $x i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub></math>$ 를 unmask 함
- 여기서 각 $m t, i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>m</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>,</mo><mi>i</mi></mrow></msub></math>$ 는 parameter $γ (t) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>γ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></math>$ 를 가지는 Bernoulli distribution에 따라 independently identically distribute 됨
  - $γ (t) \in (0, 1] <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>γ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>\in</mo><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">]</mo></math>$ : mask schedule function, $X 0 = X <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>=</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow></math>$
  - e.g.) $γ(t)=sin(πt2T),t∈(0,T]<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>γ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>=</mo><mi>sin</mi><mo data-mjx-texclass="NONE">⁡</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mfrac><mrow><mi>π</mi><mi>t</mi></mrow><mrow><mn>2</mn><mi>T</mi></mrow></mfrac><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>t</mi><mo>∈</mo><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mi>T</mi><mo stretchy="false">]</mo></math>$
- Non-autoregressive masked generative transformer는 unmasked token과 condition $C <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">C</mi></mrow></math>$ 를 기반으로 masked token을 predict 함
  1. 즉, $p θ (X 0 | X t, C) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">C</mi></mrow><mo stretchy="false">)</mo></math>$ 로 modeling 됨
  2. 그러면 parameter $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$ 는 masked token의 negative log-likelihood를 minimize 하도록 optimize 됨:
    (Eq. 1) $L m a s k = E X \in D, t \in [0, T] - \sum N i = 1 m t, i \cdot log (p θ (x i | X t, C)) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>a</mi><mi>s</mi><mi>k</mi></mrow></msub><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mrow data-mjx-texclass="ORD"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mo>\in</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mo>,</mo><mi>t</mi><mo>\in</mo><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mi>T</mi><mo stretchy="false">]</mo></mrow></msub><mo>-</mo><munderover><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>N</mi></mrow></munderover><msub><mi>m</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>,</mo><mi>i</mi></mrow></msub><mo>\cdot</mo><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mo stretchy="false">(</mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">C</mi></mrow><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$
- 추론 시에는 iterative decoding을 통해 token을 parallel decode 함
  1. 먼저 fully masked sequence $X T <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub></math>$ 에서 시작하자
  2. $1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>1</mn></math>$ 에서 $S <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>S</mi></math>$ 까지 각 step $i <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>i</mi></math>$ 에 대한 total decoding step을 $S <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>S</mi></math>$ 라고 하면, $pθ(X0|XT−(i−1)⋅TS,C)<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mrow data-mjx-texclass="ORD"><mi>T</mi><mo>−</mo><mo stretchy="false">(</mo><mi>i</mi><mo>−</mo><mn>1</mn><mo stretchy="false">)</mo><mo>⋅</mo><mfrac><mi>T</mi><mi>S</mi></mfrac></mrow></msub><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">C</mi></mrow><mo stretchy="false">)</mo></math>$ 에서 $ˆ X 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 sampling 할 수 있음
  3. 이후 confidence score에 따라 $⌊N⋅γ(T−i⋅TS)⌋<math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">⌊</mo><mi>N</mi><mo>⋅</mo><mi>γ</mi><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mi>T</mi><mo>−</mo><mi>i</mi><mo>⋅</mo><mfrac><mi>T</mi><mi>S</mi></mfrac><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo data-mjx-texclass="CLOSE">⌋</mo></mrow></math>$ token을 sampling 하고, remask 하여 $XT−i⋅TS<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mrow data-mjx-texclass="ORD"><mi>T</mi><mo>−</mo><mi>i</mi><mo>⋅</mo><mfrac><mi>T</mi><mi>S</mi></mfrac></mrow></msub></math>$ 를 얻음
    - $N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi></math>$ : $X <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow></math>$ 의 total token 수
  4. $ˆ X 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 의 $ˆ x i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mi>x</mi><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub></math>$ 에 대한 confidence score는 $xT−(i−1)⋅TS,i<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>T</mi><mo>−</mo><mo stretchy="false">(</mo><mi>i</mi><mo>−</mo><mn>1</mn><mo stretchy="false">)</mo><mo>⋅</mo><mfrac><mi>T</mi><mi>S</mi></mfrac><mo>,</mo><mi>i</mi></mrow></msub></math>$ 가 $[MASK] <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>[MASK]</mtext></math>$ token인 경우 $pθ(xi|XT−(i−1)⋅TS,C)<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mrow data-mjx-texclass="ORD"><mi>T</mi><mo>−</mo><mo stretchy="false">(</mo><mi>i</mi><mo>−</mo><mn>1</mn><mo stretchy="false">)</mo><mo>⋅</mo><mfrac><mi>T</mi><mi>S</mi></mfrac></mrow></msub><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">C</mi></mrow><mo stretchy="false">)</mo></math>$ 에 assign 됨
    - 그렇지 않은 경우 $ˆ x i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mi>x</mi><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub></math>$ 의 confidence score를 $1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>1</mn></math>$ 로 설정하여 $XT−(i−1)⋅TS<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mrow data-mjx-texclass="ORD"><mi>T</mi><mo>−</mo><mo stretchy="false">(</mo><mi>i</mi><mo>−</mo><mn>1</mn><mo stretchy="false">)</mo><mo>⋅</mo><mfrac><mi>T</mi><mi>S</mi></mfrac></mrow></msub></math>$ 에서 already unmasked token이 remask 되지 않도록 함

- Model Overview

MaskGCT는 two-stage framework로 구성됨
- 일반적으로 first stage에서는 text를 사용하여 content information과 partial prosody information을 포함하는 speech semantic representation token을 predict 함
  - Second stage에서는 더 많은 acoustic information을 학습하도록 training 됨
- 한편으로 기존의 SpearTTS, VALL-E 등은 first stage에서 autoregressive model을 사용함
- BUT, MaskGCT는 text-speech alignment supervision과 phone-level duration prediction 없이 두 stage 모두에 non-autoregressive masked generative modeling을 도입함
  1. First stage model의 경우 $S p, P <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">S</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msup><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">P</mi></mrow></math>$ 를 condition으로 하여 $p θ s 1 (S | S t, (S p, P)) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mn>1</mn></mrow></msub></mrow></msub><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">S</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">S</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mo stretchy="false">(</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">S</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msup><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">P</mi></mrow><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$ 를 학습하도록 training 됨
    - $S <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">S</mi></mrow></math>$ : semantic codec에서 얻은 speech semantic representation token seqeunce
    - $S p <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">S</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msup></math>$ : prompt semantic token sequence
    - $P <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">P</mi></mrow></math>$ : text token sequence
  2. Second stage model은 $p θ s 2 (A | A t, (A p, S)) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mn>2</mn></mrow></msub></mrow></msub><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">A</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mo stretchy="false">(</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msup><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">S</mi></mrow><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$ 을 학습하도록 training 됨
    - $A <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">A</mi></mrow></math>$ : DAC, SoundStream과 같은 speech acoustic codec의 multi-layer acoustic token sequence
    - 구조적으로는 SoundStrom과 유사함

- Speech Semantic Representation Codec

Discrete speech representation은 semantic token과 acoustic token으로 나눌 수 있음
- 일반적으로 semantic token은 speech Self-Supervised Learning (SSL) feature를 discretizing 하여 얻어짐
- 특히 기존의 large TTS system은 text를 사용하여 semantic token을 predict 한 다음, 다른 model을 사용하여 acoustic token/feature를 predict 하는 방식을 사용함
  1. Semantic token이 text/phoneme과 highly correlate 되어 있으므로 acoustic token을 directly predict 하는 것보다 prediction이 더 straightforward 하기 때문
  2. BUT, 기존에 사용된 $k <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi></math>$ -means-based semantic discretization은 information loss가 발생할 수 있음
    - 결과적으로 high-quality speech reconstruction과 precise acoustic token prediction을 어렵게 함
- 따라서 MaskGCT는 information loss를 minimize 하면서 semantic representation을 discretize 하는 것을 목표로 함
  - 이를 위해 RepCodec과 같이 VQ-VAE를 활용하여 speech SSL model에서 speech semantic representation을 reconsturct 하는 vector quantization codebook을 학습함
- Speech semantic representation sequence $S \in R T \times d <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">S</mi></mrow><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>T</mi><mo>\times</mo><mi>d</mi></mrow></msup></math>$ 의 경우, vector quantizer는 encoder $E (S) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">E</mi></mrow><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">S</mi></mrow><mo stretchy="false">)</mo></math>$ 의 output을 $E <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">E</mi></mrow></math>$ 로 quantize 하고 decoder는 $E <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">E</mi></mrow></math>$ 를 다시 $ˆ S <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">S</mi></mrow><mo stretchy="false">^</mo></mover></mrow></math>$ 로 reconstruct 함
  - 이때 $S, ˆ S <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">S</mi></mrow><mo>,</mo><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">S</mi></mrow><mo stretchy="false">^</mo></mover></mrow></math>$ 간의 reconstruction loss를 사용하여 encoder, decoder를 optimize 하고 codebook loss를 사용하여 codebook을 optimize 하고, commitment loss를 사용해 straight-through method로 encoder를 optimize 함
- 결과적으로 semantic representation codec을 training 하기 위한 total loss는:
  (Eq. 2) $Ltotal=1Td(λrec⋅||S−ˆS||1+λcodebook⋅||sg(E(S))−E||2+λcommit⋅||sg(E)−E(S)||2)<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>o</mi><mi>t</mi><mi>a</mi><mi>l</mi></mrow></msub><mo>=</mo><mfrac><mn>1</mn><mrow><mi>T</mi><mi>d</mi></mrow></mfrac><mo stretchy="false">(</mo><msub><mi>λ</mi><mrow data-mjx-texclass="ORD"><mi>r</mi><mi>e</mi><mi>c</mi></mrow></msub><mo>⋅</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">S</mi></mrow><mo>−</mo><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">S</mi></mrow><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>+</mo><msub><mi>λ</mi><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>o</mi><mi>d</mi><mi>e</mi><mi>b</mi><mi>o</mi><mi>o</mi><mi>k</mi></mrow></msub><mo>⋅</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mtext>sg</mtext><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">E</mi></mrow><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">S</mi></mrow><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>−</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">E</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub><mo>+</mo><msub><mi>λ</mi><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>o</mi><mi>m</mi><mi>m</mi><mi>i</mi><mi>t</mi></mrow></msub><mo>⋅</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mtext>sg</mtext><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">E</mi></mrow><mo stretchy="false">)</mo><mo>−</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">E</mi></mrow><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">S</mi></mrow><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub><mo stretchy="false">)</mo></math>$
  - $sg <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>sg</mtext></math>$ : stop-gradient
- 구조적으로는 w2v-BERT 2.0의 17th layer의 hidden state를 speech encoder의 semantic feautre로 활용함
  1. 여기서 Encoder, Decoder는 multiple ConvNeXt block으로 구성됨
  2. 추가적으로 DAC를 따라 factorized code를 사용하여 encoder output을 low-dimensional latent variable space로 project 함
    - Codebook에는 각각 dimension이 8인 8192 entry가 포함됨

- Text-to-Semantic Model

MaskGCT는 autoregressive model이나 text-to-speech alignment information 없이 non-autoregressive maksed generative transformer를 사용하여 Text-to-Semantic (T2S) model을 training 함
- Training 중에 semantic token sequence의 prefix를 randomly extract 하여 $S p <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">S</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msup></math>$ 로 나타냄
  1. 이후 text token sequence $P <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">P</mi></mrow></math>$ 를 $S p <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">S</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msup></math>$ 와 concatenate 하여 condition을 구성함
  2. 이때 단순히 $(P, S p) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">P</mi></mrow><mo>,</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">S</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msup><mo stretchy="false">)</mo></math>$ 를 prefix sequence로 input masked semantic token sequence $S t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">S</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 에 add 한 다음, language model의 in-context learning ability를 활용함
- 구조적으로는 GELU activation, rotation position encoding 등을 활용하는 Llama-style transformer를 model backbone으로 사용하고, causal attention을 bidirectional attention으로 대체함
  - 추가적으로 timestep $t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi></math>$ 를 condition으로 하는 adaptive RMSNorm을 사용함
- 추론 시에는 text, prompt semantic token sequence를 condition으로 specified length의 target semantic token sequence를 생성함
  - 여기서 text, prompt speech duration에 따라 total duration을 predict 하기 위해 flow matching-based duration prediction model을 도입하여 in-context learning을 지원함

- Semantic-to-Acoustic Model

Semantic token에 따라 condition 된 masked generative codec transformer를 사용하여 Semantic-to-Acoustic (S2A) model을 training 함
- 해당 S2A model은 multi-layer acoustic token sequence를 생성하는 SoundStrom을 기반으로 함
- 먼저 acoustic token sequence $A 1 : N <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">A</mi></mrow><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>:</mo><mi>N</mi></mrow></msup></math>$ 의 $N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi></math>$ layer가 주어지면, training 중에 $1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>1</mn></math>$ 에서 $N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi></math>$ 사이에서 하나의 layer를 select 함
  1. 여기서 acoustic token sequence의 $j <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>j</mi></math>$ -th layer를 $A j <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>j</mi></mrow></msup></math>$ 라 하고, $A j <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>j</mi></mrow></msup></math>$ 를 timestep $t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi></math>$ 에서 $A j <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>j</mi></mrow></msup></math>$ 를 mask 하여 $A j t <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow><mrow data-mjx-texclass="ORD"><mi>j</mi></mrow></msubsup></math>$ 를 얻음
  2. 이후 model은 prompt $A p <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msup></math>$ , semantic token sequence $S <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">S</mi></mrow></math>$ , acoustic token의 $j <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>j</mi></math>$ 보다 작은 모든 layer에 따라 $A j <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>j</mi></mrow></msup></math>$ 를 predict 하도록 training 됨
    - 즉, $p θ s 2 a (A j | A j t, (A p, S, A 1 : j - 1)) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mn>2</mn><mi>a</mi></mrow></msub></mrow></msub><mo stretchy="false">(</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>j</mi></mrow></msup><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msubsup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow><mrow data-mjx-texclass="ORD"><mi>j</mi></mrow></msubsup><mo>,</mo><mo stretchy="false">(</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msup><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">S</mi></mrow><mo>,</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">A</mi></mrow><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>:</mo><mi>j</mi><mo>-</mo><mn>1</mn></mrow></msup><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$ 과 같음
  3. 이후 $j <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>j</mi></math>$ 를 linear schedule $p(j)=1−2jN(N+1)<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo stretchy="false">(</mo><mi>j</mi><mo stretchy="false">)</mo><mo>=</mo><mn>1</mn><mo>−</mo><mfrac><mrow><mn>2</mn><mi>j</mi></mrow><mrow><mi>N</mi><mo stretchy="false">(</mo><mi>N</mi><mo>+</mo><mn>1</mn><mo stretchy="false">)</mo></mrow></mfrac></math>$ 에 따라 sampling 함
- S2A model input의 경우 semantic token sequence의 frame 수가 prompt acoustic sequence와 target acoustic sequence의 frame 수 합과 같음
  - 따라서 semantic token의 embedding과 layer $1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>1</mn></math>$ 에서 $j <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>j</mi></math>$ 까지의 acoustic token embedding을 summation 하여 사용함
- 추론 시에는 각 layer 내에서 iterative parallel decoding을 사용하여 coarse-to-fine의 각 layer에 대한 token을 생성함

(좌) Text-to-Semantic (우) Semantic-to-Acoustic

- Speech Acoustic Codec

Speech acoustic codec은 speech information을 preserve 하면서 speech waveform을 multi-layer discrete token으로 quantize 하도록 training 됨
- MaskGCT는 24K sampling rate의 speech waveform을 12 layer discrete token으로 compress 하기 위해 Residual Vector Quantization (RVQ)를 활용함
  - 이때 각 layer의 codebook size는 1024이고 codebook dimension은 8로 설정
- Model architecture, training loss는 DAC를 따르는 대신, 효율적인 inference를 위해 Vocos architecture를 decoder로 사용함

- Other Applications

MaskGCT는 duration-controllable speech translation (cross-lingual dubbing), emotion control, speech content editing, zero-shot TTS 등에 사용될 수 있음

3. Experiments

- Settings

Dataset : Emilia, LibriSpeech, SeedTTS
Comparisons : NaturalSpeech, VALL-E, VoiceBox, VoiceCraft, XTTS, CosyVoice

- Results

Zero-Shot TTS
- 전체적으로 MaskGCT의 성능이 가장 뛰어남

Autoregressive vs. Masked Generative Models
- AR + SoundStorm으로 구성된 model과 비교하여 MaskGCT가 더 robust 한 성능을 보임

Duration Length Analysis
- Duration multiplier가 $1.0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>1.0</mn></math>$ 인 경우 가장 낮은 WER을 달성함

Speech Style Imitation
- MaskGCT는 accent, emotion과 같은 speech style을 효과적으로 반영할 수 있음

Choice of Semantic Representation Codec
- VQ-based semantic token과 $k <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi></math>$ -menas-based semantic token을 비교해 보면
- VQ-based token을 사용하는 경우, speech reconstruction의 성능을 더욱 향상할 수 있음

Ablation Study
- Base model을 사용하더라도 Large model에 비해 성능 저하가 크지 않음

'Paper > Language Model' 카테고리의 다른 글

[Paper 리뷰] HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis (0)	2025.03.15
[Paper 리뷰] UniAudio: Towards Universal Audio Generation with Large Language Models (0)	2025.03.02
[Paper 리뷰] Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer (0)	2025.01.26
[Paper 리뷰] SpeechX: Neural Codec Language Model as a Versatile Speech Transformer (0)	2025.01.25
[Paper 리뷰] Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision (0)	2025.01.08

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

1. Introduction

2. Method

- Background: Non-Autoregressive Masked Generative Transformer

- Model Overview

- Speech Semantic Representation Codec

- Text-to-Semantic Model

- Semantic-to-Acoustic Model

- Speech Acoustic Codec

- Other Applications

3. Experiments

- Settings

- Results

'Paper > Language Model' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역