[Paper 리뷰] STACodec: Semantic Token Assignment for Balancing Acoustic Fidelity and Semantic Information in Audio Codecs

티스토리 뷰

Paper/Neural Codec

[Paper 리뷰] STACodec: Semantic Token Assignment for Balancing Acoustic Fidelity and Semantic Information in Audio Codecs

feVeRin 2026. 5. 7. 10:39

STACodec: Semantic Token Assignment for Balancing Acoustic Fidelity and Semantic Information in Audio Codecs

기존의 neural codec은 semantic information을 효과적으로 preserve 하지 못함
STACodec
- Self-Supervised Learning model의 semantic information을 Semantic Token Assignment를 통해 Residual Vector Quantization의 first layer에 integrate
- 추가적으로 Semantic Pre-Distillation module을 활용해 semantic tokenizer 의존성을 제거
논문 (ICASSP 2026) : Paper Link

1. Introduction

Token-based model에서 speech, audio를 사용하기 위해서는 audio tokenizer가 필요함
- Discrete audio token은 acoustic token, semantic token으로 나눌 수 있음
  1. 여기서 EnCodec, DAC와 같은 acoustic token은 reconstruction quality에 비해 semantic awareness가 부족하므로 language modeling이나 semantic-related task에서 활용하기 어려움
  2. 한편으로 HuBERT, WavLM과 같은 semantic token은 주로 pre-trained Self-Supervised Learning (SSL) model을 통해 얻어짐
- 최근의 SpeechTokenzier, X-Codec 등은 audio codec의 semantic capability를 향상하기 위해 Residual Vector Quantizer (RVQ) layer에 대한 distillation을 수행함
  - BUT, RVQ layer에서 semantic, acoustic representation 간의 mismatch로 인한 trade-off가 존재함

-> 그래서 audio codec에 semantic information을 효과적으로 반영할 수 있는 STACodec을 제안

STACodec
- First RVQ layer에 token assignment를 수행하여 semantic token alignment를 보장
- External tokenizer에 대한 의존성을 제거하기 위해 Semantic Pre-Distillation (SPD) module을 도입

< Overall of STACodec >

Semantic Token Assignment를 활용한 acoustic-semantic tokenizer
결과적으로 기존보다 우수한 성능을 달성

2. Method

- STACodec with Semantic Token Assignment

STACodec은 Semantic Token Assignment (STA)를 통해 semantic token을 Residual Vector Quantizer (RVQ)의 first layer에 directly integrate 함
Overall Pipeline
- 먼저 raw audio $\mathbf{x}$를 SSL feature에 대한 $K$-means와 같은 semantic tokenizer를 사용하여 semantic token $\mathbf{c}_{s}$로 encode 하고, Transformer bottleneck을 포함한 acoustic encoder를 통해 latent acoustic feature $\mathbf{z}$로 encode 함:
  (Eq. 1) $ \mathbf{c}_{s}=\text{SemanticTokenizer}(\mathbf{x})$
  (Eq. 2) $\mathbf{e}=\text{AcousticEncoder}(\mathbf{x})$
  (Eq. 3) $\mathbf{z}=\text{TransformerBottleneck}(\mathbf{x})$
  - $\mathbf{c}_{s}\in[V]^{T}$ : vocabulary size $V$, length $T$에 대한 semantic token sequence, $\mathbf{z}$ : latent acoustic feature
- 이후 STA를 포함한 RVQ를 사용하여 $\mathbf{z}$를 quantize 함:
  (Eq. 4) $\hat{\mathbf{z}}=\text{RVQ-STA}(\mathbf{z},\mathbf{c}_{s})$
- Final reconstructed audio $\hat{\mathbf{x}}$는 quantized output $\hat{\mathbf{z}}$를 decode 하여 얻어짐:
  (Eq. 5) $\hat{\mathbf{x}}=\text{AcousticDecoder}(\hat{\mathbf{z}})$
RVQ with Semantic Token Assignment
- RVQ-STA에서 time step $t$의 first layer code index는 semantic token $c_{s,t}\in[V]$로 assign 됨:
  (Eq. 6) $c_{1,t}=c_{s,t}$
- First layer quantized output $\hat{\mathbf{z}}_{1,t}$는 codebook $\mathbf{C}_{1}$에서 code index $c_{1,t}$를 lookup 하여 얻어지고, 이후 residual을 compute 함:
  (Eq. 7) $\hat{\mathbf{z}}_{1,t}=\mathbf{C}_{1}[c_{1,t}]$
  (Eq. 8) $\mathbf{r}_{1,t}=\mathbf{z}_{t}-\hat{\mathbf{z}}_{1,t}$
- 나머지 layer ($i=2,...,N_{q}$)에 대해서는 standard RVQ를 적용함:
  (Eq. 9) $\hat{\mathbf{z}}_{i,t,c_{i,t}}=\text{VQ}(\mathbf{r}_{i-1,t};\mathbf{C}_{i})$
  (Eq. 10) $\mathbf{r}_{i,t}=\mathbf{r}_{i-1,t}-\hat{\mathbf{z}}_{i,t}$
- Final quantized vector는 모든 $N_{q}$ layer output을 summing 하여 얻어짐:
  (Eq. 11) $\hat{\mathbf{z}}_{t}=\sum_{i=1}^{N_{q}}\hat{\mathbf{z}}_{i,t}$

- Semantic Pre-Distillation

SSL-based semantic tokenizer를 제거하고 inference efficiency를 향상하기 위해 논문은 first RVQ layer에 assign 할 semantic token을 predict 하는 Transformer-based Semantic Pre-Distillation (SPD) module을 도입함
- Quantization 이후에 semantic distillation을 수행하는 SpeechTokenizer, X-Codec과 달리, SPD는 quantization 이전에 distillation을 반영하여 acoustic decoder input에 대한 negative impact를 alleviate 함
- 이때 ovefitting을 mitigate 하기 위해 SPD module input에 대해 temporal/feature dimension masking을 적용함
  1. 각 dimension에서 specified probability로 contiguous feature segment/feature channel이 randomly mask 됨
  2. 그러면 distilled semantic token $\hat{\mathbf{c}}_{s}$는:
    (Eq. 12) $\hat{\mathbf{c}}_{s}=\text{SPD}(\text{Mask}(\mathbf{e}))$
  3. (Eq. 4)의 quantization 시 original semantic token $\mathbf{c}_{s}$는 distilled token $\hat{\mathbf{c}}_{s}$로 replace 됨

- Training Objective

STACodec은 기본적으로 EnCodec을 따라 reconstruction loss, discriminator의 perceptual loss, RVQ commitment loss로 구성된 training objective $\mathcal{L}_{codec}$을 사용함
- SPD의 경우, semantic token prediction을 guide 하기 위해 Cross-Entropy loss를 사용함:
  (Eq. 13) $\mathcal{L}_{spd}=\text{CrossEntropy}(\hat{\mathbf{c}}_{s},\mathbf{c}_{s})$
  - $\hat{\mathbf{c}}_{s}$ : SPD로 predict 된 semantic token, $\mathbf{c}_{s}$ : ground-truth token
- 그러면 overall obejctive는:
  (Eq. 14) $\mathcal{L}=\mathcal{L}_{codec}+\lambda\mathcal{L}_{spd}$
  - $\lambda$ : weight
- 이때 stable optimization을 위해 STACodec은 2-stage로 training 됨
  - 먼저 $\mathcal{L}_{codec}$만 사용하여 reconstruction ability를 establish 하고, 이후 $\mathcal{L}_{codec}, \mathcal{L}_{spd}$를 jointly learning 함

3. Experiments

- Settings

Dataset : LibriSpeech
Comparisons : SpeechTokenizer, X-Codec, PAST, HARSD

- Results

전체적으로 STACodec의 성능이 가장 우수함

STACodec은 모든 layer에 걸쳐 balanced codebook utilization을 보임

Ablation Study
- 각 component는 성능 향상에 유효함

'Paper > Neural Codec' 카테고리의 다른 글

[Paper 리뷰] VoCodec: An Efficient Lightweight Low-Bitrate Speech Codec (0)	2026.05.18
[Paper 리뷰] IBPCodec: A Low-Bitrate Lightweight Speech Codec with Inter-Band Prediction (0)	2026.05.13
[Paper 리뷰] StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs (0)	2026.04.28
[Paper 리뷰] SiTok: Scaling Speech Tokenizers with Diffusion AutoEncoders (0)	2026.04.17
[Paper 리뷰] Gogo: Group-Wise Granularity-Ordered Codec for Stable and Efficient Speech Generation (0)	2026.04.15

최근에 올라온 글

최근에 달린 댓글

« 2026/07 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] STACodec: Semantic Token Assignment for Balancing Acoustic Fidelity and Semantic Information in Audio Codecs

STACodec: Semantic Token Assignment for Balancing Acoustic Fidelity and Semantic Information in Audio Codecs

1. Introduction

2. Method

- STACodec with Semantic Token Assignment

- Semantic Pre-Distillation

- Training Objective

3. Experiments

- Settings

- Results

'Paper > Neural Codec' 카테고리의 다른 글

티스토리툴바

티스토리 뷰

[Paper 리뷰] STACodec: Semantic Token Assignment for Balancing Acoustic Fidelity and Semantic Information in Audio Codecs

document.addEventListener("DOMContentLoaded", function() { renderMathInElement(document.body, { delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false} ] });});

STACodec: Semantic Token Assignment for Balancing Acoustic Fidelity and Semantic Information in Audio Codecs

1. Introduction

2. Method

- STACodec with Semantic Token Assignment

- Semantic Pre-Distillation

- Training Objective

3. Experiments

- Settings

- Results

'Paper > Neural Codec' 카테고리의 다른 글

티스토리툴바