[Paper 리뷰] Metis: A Foundation Speech Generation Model with Masked Generative Pre-training

티스토리 뷰

Paper/Representation

[Paper 리뷰] Metis: A Foundation Speech Generation Model with Masked Generative Pre-training

feVeRin 2025. 11. 17. 13:06

Metis: A Foundation Speech Generation Model with Masked Generative Pre-training

Masked Generative Modeling을 활용하여 다양한 speech generation task에 fine-tuning 되는 speech foundation model을 구성할 수 있음
Metis
- Self-Supervised Learning token과 acoustic token에 대한 2가지 discrete speech representation을 활용
- Additional condition 없이 300K hours의 speech data에 대해 masked generative pre-training을 수행
논문 (NeurIPS 2025) : Paper Link

1. Introduction

Large-scale Self-Supervised Pre-Training을 활용한 foundation model은 다양한 downstream task에서 활용됨
- 특히 unified speech model은 Text-to-Speech (TTS), Voice Conversion, Speech Enhancement 등의 speech generation task에 활용될 수 있음
- BUT, UniAudio, SpeechX와 같은 autoregressive language model은 각 task 마다 대량의 paired training data를 확보해야 한다는 단점이 있음
  - 추가적으로 autoregressive approach는 inefficient 하고 suboptimal 함

-> 그래서 large-scale unlabeled data를 활용한 unified speech generation framework인 Metis를 제안

Metis
- Task-specific condition에 따라 Self-Supervised Learning (SSL) token을 생성한 다음, 해당 SSL token으로부터 acoustic representation을 생성하는 2-stage process를 도입
- Generative pre-training mechanism을 활용해 pre-trained model이 다양한 downstream task에 effectively generalize 되도록 simplifying

< Overall of Metis >

Generative pre-training을 활용한 speech foundation model
결과적으로 다양한 downstream speech generation task에서 기존보다 우수한 성능을 달성

2. Method

- Background: Masked Generative Models

Sequence length가 $n$인 discrete sequence $\mathbf{x}=[y_{1},y_{2},...,y_{n}]$을 고려해 보자
- $\mathbf{x}_{t}=\mathbf{x}\odot \mathbf{m}_{t}$는 $\mathbf{x}$의 token subset을 binary mask $\mathbf{m}_{t}=[m_{t,1},m_{t,2},...,m_{t,n}]$으로 masking 한다고 하자
  1. 해당 operation은 $m_{t,i}=1$이면 $x_{i}$를 special $\texttt{[MASK]}$ token으로 replace 하고 $m_{t,i}=0$이면 $x_{i}$를 unmask 함
  2. 이때 각 $m_{t, i}$는 parameter $\gamma(t)$를 가진 Bernoulli distribution에 대해 independently, identically distribute 됨
    - $\gamma(t)\in (0,1]$ : mask schedule function
- $\mathbf{x}=\mathbf{x}_{0}$라고 하면, masked generative model은 observed (unmasked) token과 condition $\mathbf{c}$를 기반으로 complete sequence (masked token)을 predict 함
- 이는 $p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{t},\mathbf{c})$와 같이 modeling 할 수 있고, model parameter $\theta$는 unobserved token에 대한 marginal cross-entropy를 optimize 하도록 training 됨:
  (Eq. 1) $ \mathcal{L}_{mask}=-\mathbb{E}_{\mathbf{x},t,\mathbf{m}_{t}}\sum_{i=1}^{n}m_{t,i}\cdot \log p_{\theta}(y_{i}|\mathbf{x}_{t},\mathbf{c})$
  - Unconditional pre-training stage의 경우 $\mathbf{c}$는 empty일 수 있음
- 추론 시 masked generative model은 fully masked sequence $\mathbf{x}_{T}$에서 시작하여 iterative decoding을 통해 token을 parallel generate 함
  1. 총 decoding step을 $S$라하면, $1$부터 $S$까지의 각 step $j$에 대해 $p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{T-(j-1)\cdot \frac{T}{S}}, \mathbf{c})$에서 $\hat{\mathbf{x}}_{0}$을 sampling 함
  2. 이후 confidence score를 기반으로 $\left\lfloor n\cdot \gamma \left(T-j\cdot \frac{T}{S}\right)\right\rfloor$ token을 remask 하여 $\mathbf{x}_{T-j\cdot \frac{T}{S}}$를 얻음
    - $n$ : $\mathbf{x}$의 sequence length
  3. $\hat{\mathbf{x}}_{0}$에서 $\hat{y}_{i}$에 대한 confidence score는 $y_{T-(j-1)\cdot \frac{T}{S},i}$가 $\texttt{[MASK]}$ token인 경우 $p_{\theta}(y_{i}|\mathbf{x}_{T-(j-1)\cdot \frac{T}{S}}, \mathbf{c})$에 assign 됨
    - 그렇지 않은 경우, $\hat{y}_{i}$의 confidence score를 $1$로 설정하여 $\mathbf{x}_{T-(j-1)\cdot\frac{T}{S}}$에서 unmask 된 token이 remask 되지 않도록 함
  4. 특히 논문은 confidence score가 가장 낮은 $\left\lfloor n\cdot \gamma \left(T-j\cdot\frac{T}{S}\right)\right\rfloor$ token을 mask 함

- Discrete Representations for Two-Stage Generation

Metis는 2-stage generation을 위해 2가지 discrete speech representation을 사용함
- SSL Token
  1. HuBERT, W2V-BERT와 같은 large-scale speech SSL model에서 derive 된 SSL feature는 semantic, prosodic information을 모두 encapsulate 하고 있으므로 conditional generation에 유용함
  2. 이때 논문은 information loss를 minimize 하기 위해 Vector Quantization (VQ) model을 사용하여 SSL feature를 discrete token으로 quantize 함
- Acoustic Token
  1. Acoustic token은 waveform에 VQ를 적용하여 얻어짐

- Masked Generative Pre-training with SSL Tokens

Speech generation task는 Condition-to-SSL token, SSL token-to-Acoustic token의 2-stage로 generalize 됨
- 특히 논문은 pre-training을 위해 SSL token에 unconditional masked generative model을 적용함
- 먼저 SSL token sequence $\mathbf{x}^{ssl}$을 randomly mask 하고 masked token을 predict 함
  1. 이때 model의 in-context learning을 향상하기 위해 probability $p$로 prompt sequence를 도입함
  2. 해당 probability에 따라 SSL token sequence에서 prefix sequence $\mathbf{x}^{ssl}_{prompt}$가 prompt로 사용되고, unmask 됨
    - 이를 통해 model은 prompt information을 활용하여 zero-shot TTS, target speaker extraction과 같은 downstream task에 대한 adaptability를 향상할 수 있음
  3. 결과적으로 pre-training objective는 $p_{\theta}(\mathbf{x}_{0}^{ssl}|\mathbf{x}_{t}^{ssl},\mathbf{x}_{prompt}^{ssl})$과 같음

- Efficient Adaptation to Various Generation Tasks

Pre-trained model을 다양한 speech generation task에 적용하기 위해, 먼저 speech generation task condition을 non-frame-level, frame-level condition으로 나눔
- Non-frame-level의 경우 TTS와 같이 condition (text, phoneme sequence)와 SSL token sequence 간의 alignment를 implicitly learning 해야 함
  - Frame-level의 경우 Voice Conversion과 같이 condition (source speech)로부터 frame-level에서 target SSL token sequence를 align 할 수 있음
- Fine-tuning 시 non-frame-level condition은 input sequence에 대해 time-dimension으로 concatenate 됨
  - Frame-level의 경우 condition을 input sequence의 time-dimension과 align 하기 위해 interpolation을 적용한 후 MLP-based adapter로 전달한 다음 input과 add 함
- 이후 fine-tuned model은 task-specific condition $\mathbf{c}$에 대해 $p_{\theta}(\mathbf{x}_{0}^{ssl}|\mathbf{x}_{t}^{ssl},\mathbf{x}_{prompt}^{ssl}, \mathbf{c})$를 학습하도록 training 됨

- Masked Generative Acoustic Decoder

논문은 Masked Generative Modeling을 기반으로 SSL-to-acoustic model을 training 하여 speech generation task를 위한 unified acoustic decoder로 사용함
- 이때 model은 SSL token $\mathbf {x}_{ssl}$과 prompt acoustic token $\mathbf{x}_{prompt}^{a}$를 condition으로 하여 masked acoustic token sequence $\mathbf{x}_{t}^{a}$에서 masked token을 recover 함
  - 즉, $p_{\theta}(\mathbf{x}_{0}^{a}|\mathbf{x}_{t}^{a},\mathbf{x}_{prompt}^{a}, \mathbf{x}_{ssl})$과 같음
- Training 시에는 multi-layered acoustic token에서 masking layer를 randomly select 하고, lower-layer token은 unmask 되어 model의 conditional input으로 사용됨
  - 추론 시에는 acoustic token을 layer-by-layer로 생성함
- 최종적으로는 speech 생성 시에는 SSL token을 predict 한 다음, acoustic token을 생성함

3. Experiments

- Settings

Dataset : Emilia
Comparisons
- Zero-Shot TTS : VALL-E, VoiceCraft, XTTS, CosyVoice, MaskGCT
- Voice Conversion : HierSpeech++, LM-VC, UniAudio, Vevo
- Target Speaker Extraction : VoiceFilter, WeSep, TSELM
- Speech Enhancement : TF-GridNet, VoiceFixer, SELM, MaskSR

- Results

Zero-Shot TTS에서 Metis는 가장 우수한 성능을 달성함

Voice Conversion에서도 마찬가지로 우수한 성능을 보임

Target Speaker Extraction task에 대해서도 뛰어난 성능을 달성함

Speech Enhancement에서도 최고의 성능을 보임

Lip-to-Speech에서도 우수한 성능을 달성함

Multi-Task Fine-Tuning
- Multi-task model인 Metis-Omni를 다양한 task에 fine-tuning 하여 사용할 수 있음

'Paper > Representation' 카테고리의 다른 글

[Paper 리뷰] SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model (0)	2025.11.04
[Paper 리뷰] SSAST: Self-Supervised Audio Spectrogram Transformer (0)	2025.10.30
[Paper 리뷰] AxLSTMs: Learning Self-Supervised Audio Representations with xLSTMs (0)	2025.09.20
[Paper 리뷰] EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast (0)	2025.09.15
[Paper 리뷰] Audio Mamba: Selective State Space for Self-Supervised Audio Representations (0)	2025.09.12

최근에 올라온 글

최근에 달린 댓글

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Metis: A Foundation Speech Generation Model with Masked Generative Pre-training

Metis: A Foundation Speech Generation Model with Masked Generative Pre-training

1. Introduction

2. Method

- Background: Masked Generative Models

- Discrete Representations for Two-Stage Generation

- Masked Generative Pre-training with SSL Tokens

- Efficient Adaptation to Various Generation Tasks

- Masked Generative Acoustic Decoder

3. Experiments

- Settings

- Results

'Paper > Representation' 카테고리의 다른 글

티스토리툴바