[Paper 리뷰] DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation

티스토리 뷰

Paper/Language Model

[Paper 리뷰] DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation

feVeRin 2025. 6. 29. 09:05

DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation

Diffusion model과 autoregressive model을 결합하면 computational load와 suboptimal outcome이 발생함
DiTAR
- Patch generation을 위해 divide-and-conquer strategy를 도입
- Langauge model은 aggregated patch embedding을 처리한 다음, diffusion Transformer를 통해 next patch를 subsequently generate
- 추론 시에는 reverse diffusion ODE 중 noise introducing time point를 temperature로 정의하여 diversity와 determinism을 balancing
논문 (ICML 2025) : Paper Link

1. Introduction

Autoregressive Speech Language Model (LM)에서 discrete tokenization은 bitrate limitation으로 인해 complex modality를 high-fidelity로 reconstruct 하는데 한계가 있음
- 특히 zero-shot Text-to-Speech (TTS)에서 CosyVoice와 같은 기존 방식은 coarse-to-fine pipeline을 가짐
  - BUT, 해당 cascaded design은 error accumulation으로 인해 LM의 scalability를 저해함
- 한편으로 LM으로 lossy discrete token을 먼저 생성한 다음, detail enrichment를 위한 token-based diffusion을 적용할 수도 있음
  1. BUT, diffusion model은 continuous representation modeling에 효과적이지만 상당한 computational demand를 요구함
  2. 이때 LM이 final feature를 directly predict 하면 process를 simplify 할 수 있음

-> 그래서 Transformer diffusion과 language model을 seamlessly combine 한 DiTAR를 제안

DiTAR
- Continuous token을 multiple patch로 break하는 divide-and-conquer strategy를 도입
  - 이때 LM은 inter-patch prediction을 수행하고 diffusion Transformer는 intra-patch prediction을 수행함
- 추가적으로 Bidirectional attention 기반의 Diffusion Transformer (DiT)로 구성된 LocDiT를 통해 localized patch를 predict하고 LocDiT의 generative capability를 향상하기 위해 LM guidance를 적용
- 추론 시에는 temperature-based sampling을 통해 diversity, determinism을 adeptly balance

< Overall of DiTAR >

Divide-and-conquer patchification과 DiT를 활용한 zero-shot TTS language model
결과적으로 기존보다 뛰어난 성능을 달성

2. Method

DiTAR는 continuous representation에 기반한 patch-based autoregressive model로써 causal-attention AR과 bidirectional-attention Transformer diffusion을 활용함

- Overview

Formulation
- DiTAR는 next-token prediction을 활용하는 autoregressive model에 해당함
- 먼저 continuous token sequence $x=(x_{1},x_{2},...,x_{N})$에 대해 chain rule을 사용하여 sequence의 joint distribution을 factorize 할 수 있음:
  (Eq. 1) $p_{\theta}(x_{1},x_{2},...,x_{N})=\prod_{i=1}^{N}p_{\theta}(x_{i}|x_{1},x_{2},...,x_{i-1})$
  - $\theta$ : AR generative model의 parameter
- 여기서 adjacent continuous token 간의 high similarity를 고려하면 local region 내에서 bidirectional dependency가 존재한다고 볼 수 있음
  - 이를 기반으로 논문은 local $x_{i}$를 size $P$의 patch로 aggregate 하고 각 patch 내의 token을 modeling 하기 위해 bidirectional attention을 도입함
- 그러면 model은 $\theta_{a}, \theta_{b}$의 두 가지 part로 나눌 수 있음:
  1. $\theta_{a}$는 $p_{\theta_{a}}(h_{i}|x_{1},x_{2},...,x_{i})$를 통해 long context learning을 담당하는 autoregressive model을 의미함
  2. $\theta_{b}$는 $p_{\theta_{b}}(x_{i+1},...,x_{i+P}|h_{i})$를 통해 next patch prediction을 수행하는 bidirectional attention diffusion Transformer를 의미함
    - $h_{i}$ : language model output으로써 diffusion condition에 해당함
- Zero-shot TTS를 AR model에 대한 conditional continuation task로 취급하면 DiTAR를 활용할 수 있음
  - 즉, target text, prompting speech를 concatenate 하여 prefix context로 model에 전달하면 주어진 context에 따라 target speech를 autoregressively generate 할 수 있음
Overall Architecture
- Causal-attention autoregressive model을 diffusion loss와 combine 하면 full attention에 비해 성능이 저하됨
- 이를 해결하기 위해 논문은 continuous token의 long sequence를 multiple patch로 divide 하는 divide-and-conquer strategy를 도입함
  1. 여기서 language model은 inter-patch prediction을 수행하고 diffusion Transformer는 intra-patch prediction을 수행함
  2. 특히 DiTAR의 backbone은 next-token prediction을 위한 causal attention Transformer로 구성되므로 continuous token의 각 patch는 aggregate encoder를 사용하여 single vector로 process 됨
    - 그런 다음 AR model에 전달되어 output embedding $h_{t}$를 생성하고, $h_{t}$는 diffusion decoder인 LocDiT의 condition으로 사용됨
  3. 추가적으로 논문은 training 시 output continuous token에 대한 diffusion loss를 사용함

- LocDiT: Next-Patch Bidirectional Modeling

Diffusion Transformer (DiT)는 bidirectional Transformer의 full receptive field를 활용하여 entire sample을 생성함
- 특히 논문은 Local Diffusion Transformer (LocDiT)를 사용하여 localized continuous token patch를 생성함
- 이때 DiT의 context learning potential을 capitalize 하기 위해 context-aware diffusion approach를 도입함
  - 이를 위해 token의 historical patch를 LocDiT의 prefix input으로 사용해 task를 outpainting과 closely aligning 하여 generation 성능을 크게 향상함
- 한편으로 CosyVoice와 같은 기존 방식은 coarse, fine feature를 explicitly delineate 함
  1. BUT, 해당 multi-stage method는 cumulative error에 prone 함
  2. 따라서 DiTAR는 각 patch의 token을 implicit lower-dimensional feature space로 condense 한 다음, LocDiT를 통해 high-fidelity continuous token으로 subsequently expand 하여 end-to-end manner로 동작함

- LM Guidance

Classifier-Free Guidance (CFG)는 generative model의 condition adherence를 개선하기 위해 사용됨
- Diffusion에서 unconditional, conditional model은 parameter를 share 하고 training 중에 condition을 intermittently omit 하여 jointly training 됨
  1. 추론 시 두 model output은 diversity, fidelity 간의 trade-off를 위해 parameter $w$로 merge 됨
  2. 이는 distribution 하에서 sampling 하는 것과 equivalent 함
    - $\theta$ : model parameter, $c$ : condition, $z_{t}$ : time $t$의 noisy sample
- Continuous-token LM에서는 주로 diffusion head에 class label을 incorporate 하고 CFG를 적용하지만, 해당 방식은 generality가 부족함
  1. 따라서 DiTAR에서는 모든 condition을 LM prefix input 내에 place 하고 CFG를 diffusion decoder에서 AR의 on-the-fly output으로 적용함
    - 여기서 AR의 $i$-th output $h_{i}$는 all historical input $(x_{0},x_{1},...,x_{i})$를 represent 함
  2. Training 시에는 $h_{i}$를 null embedding으로 randomly replace 하고, 추론 시에는 distribution $p_{\theta}(z_{i,t}|x_{1},...,x_{i-1})p_{\theta}(x_{1},...,x_{i-1}|z_{i,t})^{w}$에서 sampling 함:
    (Eq. 2) $ \tilde{\epsilon}_{\theta}(z_{i,t},h_{i})=(1+w)\epsilon_{\theta}(z_{i,t},h_{i})-w \epsilon_{\theta}(z_{i,t})$
    - $z_{i,t}$ : diffusion time $t$에서 sequence의 $i$-th noisy sample, $\epsilon_{\theta}$ : LocDiT로 estimate 된 score

- Temperature for Continuous-Valued LMs

Discrete-valued LM에 비해 Continuous-valued LM에서 temperature-based sampling은 잘 활용되지 않음
- 따라서 논문은 ODE solver와 compatible 한 temperature 정의를 가진 sampling method를 구성함
  1. Temperature $\tau \in[0,1]$을 diffusion의 reverse ODE solving에서 noise를 introduce 하는 time point로 정의하자
  2. Per-patch의 Gaussian diffusion forward process가 $x_{t}=\alpha_{t}x_{0}+\sigma_{t}\epsilon$로 주어지면, $\alpha_{t}, \sigma_{t}$는 flow path를 collectively define 함
    - $x_{0}\sim q_{data}(x_{0}),\epsilon \sim\mathcal{N}(0,I),t\in [0,1]$
  3. $\tau=1$에서 sampling process는 standard ODE sampling process와 equivalent 하고, 이때 논문은 $dt$에 대해 $1$에서 $0$까지 reverse ODE $dx_{t}=v_{\theta}(x_{t},t)dt$를 solve 함
    - $x_{1}\sim\mathcal{N}(0,I)$
  4. $\tau=0$에서는 random noise가 introduce 되지 않으므로 completely deterministic process가 됨
    - $0$은 standard Gaussian distribution에서 highest likelihood를 가지는 value이므로, 논문은 $x_{1}\equiv 0$의 sampling을 greedy sampling으로 정의하여 determinism을 보장함
- $0<\tau<1$인 경우, estimated $x_{0}$를 diffuse 하기 위해 forward process를 사용하여 $\tau$에 random noise를 introduce 함
- 이때 iterative process는:
  (Eq. 3) $x_{1}\sim \left\{\begin{matrix}
  \mathcal{N}(0,I), & \text{if}\,\,\tau=1 \\
  0, & \text{if}\,\,0\leq \tau <1 \\
  \end{matrix}\right.$
  (Eq. 4) $x_{t}=\left\{\begin{matrix}
  x_{t+\Delta t}-v_{\theta}(x_{t+\Delta t},t+\Delta t)\Delta t, & \text{if}\,\, t\neq \tau \\
  \alpha_{t}x_{\theta}(x_{t+\Delta t},t+\Delta t)+\sigma_{t}\epsilon, & \text{if}\,\,t=\tau \\
  \end{matrix}\right.$
  - $x_{\theta}$ : estimated $x_{0}$

- Implementations

Continuous Speech Tokenization
- 논문은 Variational AutoEncoder (VAE)를 사용하여 waveform을 latent $z$의 distribution으로 convert 함
- VAE encoder는 multiple layer의 convolutional network이고, decoder는 BigVGAN을 따름
  - 추가적으로 Multi-Period Discriminator (MPD), Multi-Scale Discriminator (MSD)를 discriminator로 사용함
- 결과적으로 24000Hz waveform은 64 dimension의 40Hz latent로 compress 됨
Model
- DiTAR는 Transformer 기반의 aggregator encoder, LM, decoder (LocDiT)로 구성됨
  1. Encoder, decoder는 bidirectional attention mask를 사용하고 LM은 causal attention mask를 사용함
  2. 모든 Transformer는 PreNorm architecture를 활용하고 RMSNorm, RoPE를 사용함
- Continuous token의 각 patch는 sequence beginning에 위치한 learnable special token과 함께 aggregation encoder에 전달됨
  1. 다음으로 special token position에 해당하는 output은 aggregation embedding으로 사용되고, 서로 다른 patch에서 생성되는 aggregation embedding은 LM에 대한 sequence를 구성함
  2. 이후 LM output과 time embedding이 add 되고 historical context patch, noisy target token과 함께 LocDiT input sequence를 구성함
  3. Loss calculation 시에는 noisy target token의 position에 해당하는 output 만을 고려함
- Training 시에는 LM output이 $0.1$의 probability로 all-zero vector로 randomly replace 되어 LocDiT에 대한 LM guidance를 지원함
Diffusion Formulation
- 논문은 variance-preserving diffusion process를 고려함:
  (Eq. 5) $x_{t}=\alpha_{t}x_{0}+\sigma_{t}\epsilon$
  (Eq. 6) $\,\,\,\,\,\,\,=\cos\left(\frac{\pi t}{2}\right)x_{0}+\sin\left(\frac{\pi t}{2}\right)\epsilon$
  - $x_{0}\sim q(x_{0})$ : data, $\epsilon\sim\mathcal{N}(0,I)$ : standard Gaussian noise, $t\in [0,1]$
- 이때 Conditional Flow-Matching은:
  (Eq. 7) $\mathcal{L}_{diff}=\mathbb{E}_{t,x_{0},\epsilon}\left[\left|\left| v_{\theta}(x_{t},t)-v(x_{t},t)\right|\right|_{2}^{2}\right]$
  - Velocity는 $v(x_{t},t)=\dot{x}_{t}=\dot{\alpha}_{t}x_{t}+\dot{\sigma}_{t}\epsilon$과 같이 정의됨
- 추론 시에는 $t$이 아닌 signal-to-noise ratio $\frac{\alpha_{t}^{2}}{\sigma_{t}^{2}}$에 대한 Euler ODE sampler에 해당하는 DDIM sampler를 사용함
  - 이는 diffusion ODE의 semi-linear property와 better align 됨
Zero-Shot TTS System
- Text sequence는 phoneme으로 convert 된 다음 lookup table을 통해 text embedding을 얻고, speech token은 aggregation encoder를 통해 speech embedding을 생성한 다음 text embedding과 concatenate 됨
  1. 이후 embedding sequence는 LM에 대한 input으로 사용됨
  2. 추가적으로 LM output에 stop을 predict 하기 위한 fully-connected layer로 구성된 binary classifier를 적용함
- 결과적으로 zero-shot TTS에 대한 loss function은:
  (Eq. 8) $\mathcal{L}=\mathcal{L}_{diff}+\mathcal{L}_{stop}$
- 추론 시에는 prompt audio, text, target text가 LM의 prefix input으로 제공되어 target audio를 autoregressively generate 함

3. Experiments

- Settings

Dataset : LibriLight, Emilia
Comparisons : VALL-E, Mega-TTS2, NaturalSpeech2, NaturalSpeech3, VoiceBox, MaskGCT, E2-TTS, F5-TTS

- Results

전체적으로 DiTAR는 뛰어난 성능을 달성함

MOS 측면에서도 우수한 성능을 보임

Scaling Behaviors
- Training data, model parameter가 증가할수록 DiTAR의 성능도 향상됨

Langauge model과 LocDiT를 enlarge 하는 것에 비해 encoder는 성능 향상에 큰 영향을 주지 않음

Patch Size
- LocDiT의 patch size가 너무 크거나 작으면 성능이 저하됨

LM Guidance
- LM guidance를 사용하면 diffusion decoder의 inference process를 개선할 수 있음

Impact of Temperature
- Temperautre $\tau$가 증가할수록 speaker diversity도 증가함

Efficiency
- Batch size가 증가하면 DiTAR의 throughput은 rapidly increase 하여 NAR 보다 더 나은 efficiency를 보임

'Paper > Language Model' 카테고리의 다른 글

[Paper 리뷰] CosyVoice2: Scalable Streaming Speech Synthesis with Large Language Models (0)	2025.07.26
[Paper 리뷰] MELLE: Autoregressive Speech Synthesis without Vector Quantization (0)	2025.07.02
[Paper 리뷰] ELLA-V: Stable Neural Codec Language Modeling with Alignment-Guided Sequence Recording (0)	2025.05.25
[Paper 리뷰] Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners (0)	2025.05.01
[Paper 리뷰] Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis (0)	2025.03.29

최근에 올라온 글

최근에 달린 댓글

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation

DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation

1. Introduction

2. Method

- Overview

- LocDiT: Next-Patch Bidirectional Modeling

- LM Guidance

- Temperature for Continuous-Valued LMs

- Implementations

3. Experiments

- Settings

- Results

'Paper > Language Model' 카테고리의 다른 글

티스토리툴바