[Paper 리뷰] TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling

티스토리 뷰

Paper/Neural Codec

[Paper 리뷰] TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling

feVeRin 2025. 11. 11. 13:02

TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling

기존의 speech tokenizer는 high frame rate와 auxiliary pre-trained model에 대한 의존성, complex training process와 같은 한계점이 존재함
TaDiCodec
- Diffusion AutoEncoder를 활용해 quantization, reconstruction에 대한 end-to-end optimization을 수행
- Text guidance를 diffusion decoder에 integrate 하여 optimal compression을 달성
논문 (NeurIPS 2025) : Paper Link

1. Introduction

VALL-E, CosyVoice, SPEAR-TTS와 같은 Large Language Model (LLM)-based speech model은 continuous speech signal을 discrete token sequence로 convert 하기 위해 speech tokenizer를 사용함
- BUT, 기존의 speech tokenizer는 speech language modeling에 suboptimal 함
  1. 대표적으로 EnCodec, SoundStream, DAC 등은 speech signal compression/transmission을 목표로 하고, multi-layer Residual Vector Quantization (RVQ)와 high bitrate를 가지므로 inefficient 함
  2. 반면 WavTokenizer는 single-layer tokenizer라는 장점이 있지만 RVQ-based tokenizer에 비해 낮은 reconstruction quality를 보임
- 추가적으로 대부분의 speech tokenizer는 acoustic-level reconstruction을 수행하므로 semantic richness가 부족한 discrete representation을 생성해 reconstruction-generation gap을 유발함
  - 즉, language modeling을 위해서는 speech token이 low frame rate와 semantic richness를 가져야 함
- 이를 위해 X-Codec, DualCodec, SpeechTokenizer와 같이 Self-Supervised Learning (SSL) model에서 feature를 추출하여 semantic/acoustic token을 decompose 하는 것을 고려할 수 있음
  1. 그러면 two-stage design을 따라 SSL-based feature를 quantization 하고, 해당 token에 condition 된 speech를 reconstruct 하는 diffusion model을 training 할 수 있음
  2. BUT, 해당 방식은 two-stage training으로 인한 complexity, pre-trained SSL model에 대한 의존성, ultra-low token을 처리하기 어렵다는 한계점이 있음

-> 그래서 single-codebook을 기반으로 low frame rate와 high-fidelity reconstruction, robust speech language modeling을 지원하는 TaDiCodec을 제안

TaDiCodec
- Quantization과 reconstruction을 end-to-end diffusion autoencoder 내에서 unify 하여 semantic distillation이나 diffusion loss에 대한 의존성을 제거
- Diffusion decoder에 text prompt guidance를 incorporate 하여 reconstruction quality를 향상

< Overall of TaDiCodec >

End-to-End diffusion autoencoder와 text prompt guidance를 활용한 low bitrate speech tokenizer
결과적으로 기존보다 우수한 성능을 달성

2. Method

- Speech Tokenization with Diffusion Transformer AutoEncoder

TaDiCodec은 input/reconstruction target으로 mel-spectrogram을 채택함
- 먼저 frame 수 $T$에 대해 input mel-spectrogram을 $x\in \mathbb{R}^{T\times d}$라고 하자
  1. Tokenizer encoder $\mathcal{E}$는 $x$를 latent embedding sequence $\mathcal{E}(x)$로 transform 함
  2. 해당 embedding은 Vector Quantization (VQ) module $\mathcal{Q}$를 통해 discrete token sequence $q=\mathcal{Q}(\mathcal{E}(x))\in\mathbb{Z}^{T_{q}\times 1}$으로 quantize 됨
    - $T_{q}$ : token sequence length로써 $T$를 pre-defined downsampling factor로 나눈 값과 같음
  3. $i\in [0,T_{q})$에 대해 각 token $q_{i}$는 codebook index에 대응하고, decoder $\mathcal{D}$는 mel-spectrogram을 $\hat{x}=\mathcal{D}(q)$와 같이 reconstruct 함
- 구조적으로 TaDiCodec은 Transformer architecture를 활용하고 reconstruction training을 위해 Flow Matching-based decoder와 diffusion loss를 채택하여 stable optimization을 제공함
- Training 시에는 randomly sampled noise level $t\in [0,1]$에 대해 Gaussian noise $\epsilon$을 sampling 하고 lienar interpolation $x_{t}=tx+(1-t)\epsilon$을 통해 noisy target $x_{t}$를 생성함
  1. 이후 model은 velocity field $v$를 predict 하도록 training 되고, 이는 $t$에 대한 $x_{t}$의 derivative로 정의할 수 있음
  2. 즉, $v=\frac{dx_{t}}{dt}=x-\epsilon$

- Binary Spherical Quantization

Quantization을 위해 explicit learnable codebook을 사용하지 않는 Binary Spherical Quantization (BSQ)를 채택함
- 먼저 encoder output $\mathcal{E}(x)$에 downsampling을 적용한 다음, low-dimensional latent sequence를 얻기 위해 linear projection을 적용함
  1. 즉, $h=\text{Linear}(\text{Downsample}(\mathcal{E}(x)))\in\mathbb{R}^{T_{q}\times L}$
    - $T_{q}$ : quantized frame 수, $L$ : latent dimension
  2. $h$의 각 vector $h_{t}\in\mathbb{R}^{L}$은 unit sphere로 $u_{t}=\frac{h_{t}}{|| h_{t}||}$와 같이 project 됨
  3. Binary quantization은 각 dimension 마다 independently apply 됨: $\hat{u}_{t}=\frac{1}{\sqrt{L}}\text{sign}(u_{t})$
    - $\text{sign}(x)$ : element-wise sign function
  4. 이후 quantization step에서 gradient flow를 지원하기 위해 Straight-Through Estimator (STE) $\text{sign}_{STE}(x)=\text{sg}(\text{sign}(x)-x)+x$를 적용함
    - $\text{sg}(\cdot)$ : stop-gradient operation
  5. 그러면 quantized latent sequence $\hat{u}\in\mathbb{R}^{T_{q}\times L}$은 $d$-dimensional space로 mapping 되고 original temporal resolution으로 upsample 됨: $\text{Upsample}(\text{Linear}(\hat{u}))\in \mathbb{R}^{T\times d}$
- 결과적으로 각 quantized vector $h_{t}$는 다음과 같이 discrete token index를 compute 하는 데 사용됨:
  (Eq. 1) $ k_{t}=\sum_{i=1}^{L}1_{[h_{t,i}>0]}\cdot 2^{i-1}$
  - $1_{[\cdot]}$ : indicator function
- 여기서 BSQ는 quantization error가 theoretically bound 되어 있어 commitment loss가 필요하지 않으므로, system은 diffusion loss 만으로도 end-to-end training 됨

- Text-aware De-Tokenization

기존 speech tokenizer는 reconstruction을 위해 speech feature만 사용하지만, speech language modeling 측면에서 speech에 대한 text를 활용할 수 있음
- 대표적으로 Text-to-Speech (TTS) task에서는 target text가 always known으로 주어짐
- 따라서 논문은 text sequence $x_{text}$를 condition으로 사용하는 Text-aware De-Tokenization을 도입함
  1. 특히 extremely-low compression rate setting에서 reconstruction quality를 향상하기 위해 MaskGCT, VoiceBox, F5-TTS, E2-TTS를 따라 prompt mechanism을 적용함
  2. Training 시에는 mel-spectrogram의 total frame 수 $L$에 대해 input mel-spectrogram에서 segment length $l\sim \text{Uniform}(0,0.25L)$을 추출하여 prefix $x_{prompt}$를 random sampling 함
    - 이때 prefix는 added noise 없이 preserve 되고 loss는 sequence의 noisy portion에서만 compute 됨
  3. 실제로 해당 text conditioning을 제거하는 경우 extremely low token rate에서 상당한 성능 저하가 발생함
- 한편으로 TaDiCodec은 two-stage pipeline을 따르는 기존 방식과 달리 feature quantization과 reconstruction을 end-to-end manner로 jointly learning 함
- 그러면 overall training objective는:
  (Eq. 2) $ \mathcal{L}_{diff}=\mathbb{E}_{(x,x_{text}),\epsilon,t}\left[\left|\left| (x-\epsilon)-\mathcal{D}_{\phi}\left(\mathcal{Q}(\mathcal{E}_{\theta}(x)),x_{t},t,x_{text}\right)\right|\right|\right]$
  - $\mathcal{E}_{\theta}, \mathcal{D}_{\phi}$ : 각각 $\theta, \phi$로 parameterize 된 encoder, decoder

- Speech Language Modeling with TaDiCodec

논문은 large-scale multilingual zero-shot TTS task에 TaDiCodec tokenizer를 적용한 AR+Diffusion paradigm을 구성함
- 즉, autoregressive model이 text $x_{text}$로부터 speech token $q$를 predict 한 다음, 해당 token과 text를 TaDiCodec의 diffusion decoder에 전달하여 speech를 생성함
- $\psi$로 parameterize 된 AR model은 input text와 previously predicted token을 condition으로 하여 target token sequence의 negative log-likelihood를 minimize 하도록 optimize 됨:
  (Eq. 3) $ \mathcal{L}_{AR}=-\mathbb{E}_(q,x_{text})\sum_{i=1}^{T_{q}}\log p\left(q_{i}|q_{<i}, x_{text};\psi\right)$
  - $q_{i}$ : $q$의 $i$-th token
- 추가적으로 논문은 speech token을 modeling 하기 위해 non-autoregressive Masked Generative Modeling (MGM)을 적용함

3. Experiments

- Settings

Dataset : Emilia
Comparisons : EnCodec, DAC, SpeechTokenizer, X-Codec, WavTokenizer, DualCodec, BigCodec, TAAE, Mimi, BiCodec

- Results

전체적으로 TaDiCodec의 성능이 가장 우수함

Multilingual speech reconstruction 측면에서도 우수한 성능을 보임

MOS 측면에서도 뛰어난 성능을 달성함

Ablation Study
- 각 approach는 성능 향상에 유효함

Zero-Shot TTS
- Zero-Shot TTS에서도 TaDiCodec을 활용하면 더 나은 성능을 달성할 수 있음

'Paper > Neural Codec' 카테고리의 다른 글

[Paper 리뷰] Language-Codec: Bridging Discrete Codec Representations and Speech Language Models (0)	2025.11.27
[Paper 리뷰] SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound (0)	2025.11.18
[Paper 리뷰] FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks (0)	2025.11.05
[Paper 리뷰] PAST: Phonetic-Acoustic Speech Tokenizer (0)	2025.09.24
[Paper 리뷰] Factorized RVQ-GAN for Disentangled Speech Tokenization (0)	2025.09.22

최근에 올라온 글

최근에 달린 댓글

« 2026/03 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling

TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling

1. Introduction

2. Method

- Speech Tokenization with Diffusion Transformer AutoEncoder

- Binary Spherical Quantization

- Text-aware De-Tokenization

- Speech Language Modeling with TaDiCodec

3. Experiments

- Settings

- Results

'Paper > Neural Codec' 카테고리의 다른 글

티스토리툴바