[Paper 리뷰] MELA-TTS: Joint Transformer-Diffusion Model with Representation Alignment for Speech Synthesis

티스토리 뷰

Paper/TTS

[Paper 리뷰] MELA-TTS: Joint Transformer-Diffusion Model with Representation Alignment for Speech Synthesis

feVeRin 2026. 3. 27. 11:10

MELA-TTS: Joint Transformer-Diffusion Model with Representation Alignment for Speech Synthesis

End-to-End Text-to-Speech를 위해 joint Transformer-Diffusion framework를 활용할 수 있음
MELA-TTS
- Linguistic, speaker condition으로부터 continuous mel-spectrogram을 autoregressively generate
- Transformer decoder의 output representation을 pre-trained ASR encoder의 semantic embedding과 align 하는 representation alignment module을 도입
논문 (ICASSP 2026) : Paper Link

1. Introduction

VALL-E, CosyVoice3와 같은 discrete token 기반의 autoregressive model은 Text-to-Speech (TTS)에서 우수한 성능을 보임
- BUT, discretization 시 information loss가 발생하고 2-stage framework로 인한 system complexity가 존재함
  - 이를 위해 DiTAR와 같은 end-to-end framework를 고려할 수 있음
- 해당 end-to-end framework 역시 다음의 한계점이 존재함:
  1. Content consistency 측면에서 기존 discrete-token-based model 보다 성능이 떨어짐
  2. Continuous feature에 대한 autoregressive modeling은 converge 하는데 많은 training iteration이 필요함

-> 그래서 end-to-end autoregressive modeling의 한계점을 개선한 MELA-TTS를 제안

MELA-TTS
- Mel-spectrogram을 autoregressively generate하기 위해 joint Transformer와 diffusion model을 활용
- Content consistency를 위해 pre-trained Automatic Speech Recognition (ASR) encoder feature와 intermediate representation을 align 하는 Representation Alignment module을 도입

< Overall of MELA-TTS >

Diffusion Transformer와 representation alignment module을 활용한 end-to-end autoregressive TTS model
결과적으로 기존보다 우수한 성능을 달성

2. Method

MELA-TTS는 autoregressive Transformer decoder, diffusion module로 구성됨
- Autoregressive Transformer decoder는 continuous vector $\mathbf{h}$를 sequentially generate 함
  1. Diffusion module은 speaker, utterance embedding을 conditional input으로 사용하여 noisy mel-spectrogram chunk에 대해 denoising process를 수행함
  2. Mel-spectrogram이 생성되면 neural vocoder를 사용하여 speech waveform을 얻음
- 특히 논문은 continuous vector $\mathbf{h}$를 pre-trained ASR encoder의 output representation과 align 하기 위해 Representation Alignment module을 도입함
  - 해당 module은 $\mathbf{h}$가 semantically informative 하도록 유도하여 content consistency를 향상함

- Transformer Decoder for Autoregressive Modeling

Transformer decoder는 utterance embedding, speaker embedding, tokenized text, mel-spectrogram history $\mathbf{X}=[x_{1},x_{2},...,x_{L}]$에 condition 되어 continuous vector $\mathbf{h}$를 autoregressively generate 함
- Training 시 utterance embedding은 Transformer encoder를 통해 input speech의 randomly cropped segment로부터 추출됨
  1. Transformer encoder는 utterance embedding vector로 pool 되는 feature를 output 하고 Transformer decoder, diffusion model과 함께 jointly optimize 됨
    - Speaker embedding은 pre-trained speaker encoder를 통해 input speech에서 capture 됨
  2. 추론 시에는 utterance, speaker embedding 모두 prompt speech에 derive 됨
- Text input은 Qwen2 tokenizer를 사용하여 BPE token으로 tokenize 되고 Qwen2 text embedding layer를 사용해 embedding으로 convert 됨
  1. Mel-spectrogram의 $i$-th chunk $\mathbf{X}^{(i)}=[x_{i\times N+1}, ...,x_{(i+1)\times N}]\in\mathbb{R}^{N\times D_{mel}}$은 downsampling 되고 strided convolution layer을 통해 $[1,D_{trans}]$ shape로 project 된 다음, Transformer decoder에 input 됨
    - $N$ : chunk size, $D_{mel}$ : mel-spectrogram dimension, $D_{trans}$ : Transformer decoder dimension
  2. Final Transformer decoder layer output $\mathbf{h}$는 diffusion model의 condition으로 사용됨
- Special End-to-Sequence (EOS) token prediction을 통해 generation을 terminate 하는 discrete-token-based TTS model과 달리, 논문은 synthesis end를 결정하는 stop prediction module을 사용함
  1. Stop prediction module은 continuous hidden representation sequence $\mathbf{h}$를 input으로 하여 각 step에서 binary decision을 output 하는 binary classifier를 사용함
    - $0$ : continuation, $1$ : synthesis termination
  2. 해당 module은 Binary Cross-Entropy (BCE) loss $\mathcal{L}_{stop}$을 통해 training 됨

- Diffusion for Mel-Spectrogram Generation

Diffusion module은 Diffusion Transformer를 기반으로 구성됨
- 특히 $[h_{i-1},h_{i}]$, speaker embedding $\mathbf{v}$, utternace embedding $\mathbf{u}$, previous mel-spectrogram chunk를 포함한 noisy mel-spectrogram chunk $[\mathbf{X}_{0}^{(i-1)},\mathbf{X}_{t}^{(i)}]$를 기반으로 mel-spectrogram $\mathbf{X}_{0}^{(i)}:=\mathbf{X}^{(i)}$를 predict 함:
  (Eq. 1) $\hat{\mathbf{X}}_{0}^{(i)}=\text{DiT}\left(\Psi_{i},\left[\mathbf{X}_{0}^{(i-1)},\mathbf{X}_{t}^{(i)} \right]\right) = \text{DiT}\left(\left[h_{i-1},h_{i}\right],\mathbf{v},\mathbf{u}\left[\mathbf{X}^{(i-1)}_{0}, \mathbf{X}^{(i)}_{t}\right]\right)$
  - $ \Psi_{i}=\{[h_{i-1},h_{i}],\mathbf{v},\mathbf{u}\}$ : condition, $\mathbf{X}_{t}^{(i)}=\alpha_{t}\mathbf{X}_{0}^{(i)}+\sigma_{t}\epsilon$ : diffusion forward process, $\epsilon$ : Gaussian noise
  - 추가적으로 Variance Preserving (VP)를 따라 $\alpha_{t}=\cos \left(\frac{\pi t}{2}\right)$, $\sigma_{t}=\sin\left(\frac{\pi t}{2}\right)$로 설정함
- Previous continuous vector $h_{i-1}$과 mel-spectrogram chunk $\mathbf{X}_{0}^{(i-1)}$은 diffusion model의 prefix context로 제공되고, prefix part의 output은 discard 됨
- Loss는 prediction과 ground-truth mel-spectrogram 간의 $L2$ distance로 얻어짐:
  (Eq. 2) $\mathcal{L}_{diff}=\sum_{t}\left(\hat{\mathbf{X}}_{0}^{(i)}-\mathbf{X}_{0}^{(i)}\right)^{2}$

- Representation Alignment Module

End-to-End model은 mel-spectrogram이나 continuous representation을 directly predict 하므로 semantically enriched intermediate를 생성하도록 explicitly guide 되지 않음
- BUT, intermediate semantic guidance가 부족할 경우 poor content consistency와 training 시 slow convergence가 발생할 수 있음
- 따라서 논문은 Representation Alignment module을 도입하여 해당 문제를 해결함
  1. 특히 autoregressive Transformer output $\mathbf{h}$를 ASR encoder의 pre-trained semantic representation $\mathbf{h}_{asr}$과 align 하기 위해 cosine-similarity loss를 add 함:
    (Eq. 3) $\mathcal{L}_{align}=\text{CosineSimilarity}\left(\text{TAM}(\mathbf{h}),\mathbf{h}_{asr}\right)$
    - $\text{TAM}$ : $\mathbf{h}$, $\mathbf{h}_{asr}$ 간의 temporal resolution mismatch를 resolve 하는 time alignment module
  2. 결과적으로 overall training loss는 다음과 같이 얻어짐:
    (Eq. 4) $\mathcal{L}=\mathcal{L}_{diff}+\mathcal{L}_{stop}+\mathcal{L}_{align}$

- Streaming Synthesis

Streaming synthesis를 위해 text token과 continuous conv-downsampled mel-spectrogram을 $n:m$ ratio로 interleave 하여 incremental speech synthesis를 지원함
- 이를 통해 각 received $n$ text token 마다 $m$ mel-spectrogram generation이 가능함
- MELA-TTS는 interleaved, non-interleaved sequence 모두에서 simultaneously training 되므로 streaming/non-streaming synthesis를 모두 지원함
  1. Turn-of-Speech token은 text input end를 나타내고 filling token은 position만 mark 하고 target prediction/loss calculation에서는 exclude 됨
  2. Speech generation termination은 binary classification module을 활용함

3. Experiments

- Settings

Dataset : LibriTTS
Comparisons : F5-TTS, MaskGCT, DiTAR, CosyVoice, CosyVoice2, CosyVoice3, Seed-TTS

- Results

전체적으로 MELA-TTS의 성능이 가장 우수함

A/B Test 측면에서도 MELA-TTS가 더 선호됨

Ablation Study
- 각 component는 성능 향상에 유효함

Representation alignment를 활용하면 더 나은 WER을 얻을 수 있음

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis (0)	2026.03.25
[Paper 리뷰] VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency (0)	2026.03.23
[Paper 리뷰] DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis (0)	2026.03.18
[Paper 리뷰] DMP-TTS: Disentangled Multi-Modal Prompting for Controllable Text-to-Speech with Chained Guidance (0)	2026.03.11
[Paper 리뷰] EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS (0)	2026.03.05

최근에 올라온 글

최근에 달린 댓글

« 2026/03 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] MELA-TTS: Joint Transformer-Diffusion Model with Representation Alignment for Speech Synthesis

MELA-TTS: Joint Transformer-Diffusion Model with Representation Alignment for Speech Synthesis

1. Introduction

2. Method

- Transformer Decoder for Autoregressive Modeling

- Diffusion for Mel-Spectrogram Generation

- Representation Alignment Module

- Streaming Synthesis

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바