[Paper 리뷰] SACodec: Asymmetric Quantization with Semantic Anchoring for Low-Bitrate High-Fidelity Neural Speech Codecs

티스토리 뷰

Paper/Neural Codec

[Paper 리뷰] SACodec: Asymmetric Quantization with Semantic Anchoring for Low-Bitrate High-Fidelity Neural Speech Codecs

feVeRin 2026. 4. 1. 13:01

SACodec: Asymmetric Quantization with Semantic Anchoring for Low-Bitrate High-Fidelity Neural Speech Codecs

Neural speech codec은 low bitrate에서 fundamental trade-off가 존재함
SACodec
- Semantic Anchoring mechanism을 활용한 asymmetric dual quantizer를 도입
- Semantic/acoustic detail quantization을 decouple 하여 codebook utilization과 fine-grained information reconstruction을 보장
논문 (AAAI 2026) : Paper Link

1. Introduction

Neural speech codec은 high-dimensional signal을 low-dimensional symbol sequence로 convert 함
- BUT, EnCodec, DAC와 같은 기존의 neural codec은 multi-layer Residual Vector Quantization (RVQ)에 의존하므로 low bitrate에서 fundamental trade-off가 나타남
  - 특히 Speech Language Model에서는 quantization error로 인해 audible artifact와 modeling complexity 문제가 발생할 수 있음
- 이를 위해 WavTokenizer와 같은 single-codebook을 고려할 수 있지만, acoustic-oriented optimization으로 인해 explicit semantic structure가 부족하다는 한계점이 있음
  - 한편으로 codec에 semantic을 infuse 하는 경우 codebook collapse 문제가 발생하거나 reconstruction quality를 preserve 하는 complex multi-layer RVQ backend를 사용해야 함

-> 그래서 semantic injection과 acoustic modeling 간의 trade-off를 만족하는 SACodec을 제안

SACodec
- Fixed, large-scale mHuBERT codebook을 활용한 semantic anchoring mechanism을 통해 strong semantic prior를 inject 하고 semantic layer의 codebook collapse를 방지
- Asymmetric dual quantizer architecture를 기반으로 distinct semantic/acoustic information을 quantize 해 fine-grained acoustic detail을 compensate

< Overall of SACodec >

Semantic anchoring mechanism, asymmetric dual quantizer를 활용한 neural codec
결과적으로 기존보다 우수한 성능을 달성

2. Method

SACodec은 asymmetric dual quantizer의 embedding $\mathbf{e}_{1}, \mathbf{e}_{2}$를 element-wise addition 하여 final representation $\mathbf{e}_{final}=\mathbf{e}_{1}+\mathbf{e}_{2}$를 얻음
- 이후 해당 representation을 decoder로 전달하여 high-fidelity waveform $\hat{\mathbf{x}}$를 reconstruct 함

- Overall Framework

SACodec은 GAN-based end-to-end framework를 기반으로 encoder, asymmetric dual quantizer, decoder의 3가지 component로 구성됨
- Encoder는 speech waveform $\mathbf{x}\in\mathbb{R}^{L}$을 input으로 사용함
  1. 구조적으로는 EnCodec을 따라 ELU activation, 2-layer LSTM을 포함한 convolutional stack으로 구성됨
    - 즉, 24kHz waveform을 strided convolution을 통해 $320\times$ downsampling 하여 75Hz frame rate를 얻음
  2. Final linear layer는 feature를 target dimension $D$로 project 하여 continuous latent representation $\mathbf{h}\in\mathbb{R}^{T\times D}$를 생성함
- 이후 해당 latent representation $\mathbf{h}$는 asymmetric dual quantizer로 전달됨
  1. 먼저 semantic anchoring module $\mathbf{Q}_{1}$은 projected mHuBERT codebook에 대해 $\mathbf{h}$를 quantizing 하여 core semantic content를 추출하고 embedding $\mathbf{e}_{1}$을 생성함
  2. Resulting acoustic residual은 residual activation module $\mathbf{Q}_{2}$를 통해 quantize 되고 SimVQ technique을 통해 embedding $\mathbf{e}_{2}$에서 fine-grained acoustic detail을 caputre 함
- 두 module의 quantized embedding은 element-wise addition $\mathbf{e}_{final}=\mathbf{e}_{1}+\mathbf{e}_{2}$로 fuse 되고 decoder로 전달됨
  1. Decoder는 Vocos를 따라 feature processing과 signal synthesis를 decoupling 함
    - ConvNeXt-attention backbone은 feature sequence에서 local/global dependency를 modeling 함
  2. 이후 해당 feature를 complex spectrogram으로 project 한 다음, iSTFT를 통해 output waveform $\hat{\mathbf{x}}$로 convert 함
- Encoder-quantizer-decoder는 multi-scale, multi-period discriminator를 통해 adversarially training 됨

- Asymmetric Dual Quantizer

Semantic Anchoring Module
- 기존 learnable VQ의 codebook collapse를 방지하고 strong semantic prior를 directly inject 하기 위해 semantic anchoring quantizer $\mathbf{Q}_{1}$은 fixed external knowledge를 기반으로 구축됨
  - 이를 위해 mHuBERT feature에서 $K_{1}=1000$ centroid로 clustering 된 public-available semantic codebook $\mathbf{C}_{sem}\in\mathbb{R}^{K_{1}\times D_{s}}$를 사용함
- Encoder acoustic representation $\mathbf{h}$와 fixed semantic space 간의 distributional gap을 bridge 하기 위해 codebook-space projection strategy를 도입함
  1. 특히 entire frozen codebook $\mathbf{C}_{sem}$을 dynamically adapt, effective codebook $\mathcal{C}_{1}$으로 transform 하는 lightweight linear projector $\mathbf{P}_{sem}$을 학습함:
    (Eq. 1) $\mathcal{C}_{1}=\mathbf{P}_{sem}(\mathbf{C}_{sem})$
    - $\mathbf{P}_{sem}$ : source codebook을 encoder latent space dimension $D$에 mapping 함
  2. 각 frame $\mathbf{h}_{t}$에 대해 quantization index $i_{t}$와 embedding $\mathbf{e}_{1,t}$는 해당 adapted codebook에서 nearest-neighbor lookup을 통해 find 됨:
    (Eq. 2) $ i_{t}=\arg\min_{k}||\mathbf{h}_{t}-\mathbf{c}_{1,k}||_{2}^{2},\,\,\,\text{where}\,\,\mathbf{c}_{1,k} \in\mathcal{C}_{1}$
    - 해당 global transformation은 full codebook utilization과 reconstruction quality를 향상함
Residual Activation Module
- Semantic embedding $\mathbf{e}_{1,t}$는 content는 capture 하지만 perceptual acoustic detail은 discard 함
- Speaker timbre, prosodic rhythm, speaking style과 같은 paralinguistic attribute를 acoustic residual $\mathbf{r}_{t}$라고 하자:
  (Eq. 3) $\mathbf{r}_{t}=\mathbf{h}_{t}-\mathbf{e}_{1,t}$
- 해당 residual을 quantize 하기 위해 second quantizer module $\mathbf{Q}_{2}$는 SimVQ로 enhance 된 single-layer vector quantizer를 채택함
  1. 이때 residual codebook을 directly learning 하지 않고 SimVQ를 통해 frozen, randomly initialized coefficient matrix $\mathbf{C}_{coeff}\in\mathbb{R}^{K_{2}\times d}$와 learnable linear latent basis $\mathbf{W}_{basis}\in\mathbb{R}^{d\times D}$의 product로 reparameterize 함:
    (Eq. 4) $\mathcal{C}_{2}=\mathbf{C}_{coeff}\times \mathbf{W}_{basis}$
  2. Training 시에는 $\mathbf{W}_{basis}$만 update 하고, gradient는 shared basis로 flow back 되어 entire residual codebook $\mathcal{C}_{2}$를 globally update 함
    - 이는 $K_{2}=1024$ entry에 대한 full codebook activation을 보장함
  3. 이후 $\mathcal{C}_{2}$에서 nearest-neighbor lookup을 통해 quantized residual embedding $\mathbf{e}_{2,t}$를 find 함

- Training Objective

SACodec은 GAN framework를 기반으로 end-to-end training 됨
- Generator $G$는 reconstruction, perceptual quality, quantization stability에 대한 composite loss를 통해 optimize 되고 discriminator set $\{D_{k}\}$는 real/generated audio를 distinguish 하도록 training 됨
- 결과적으로 overall generator loss $\mathcal{L}_{G}$는 다음과 같이 얻어짐:
  (Eq. 5) $\mathcal{L}_{G}=\lambda_{rec}\mathcal{L}_{rec}+\lambda_{adv}\mathcal{L}_{adv}+\lambda_{feat}\mathcal{L}_{feat}+\lambda_{c1}\mathcal{L}_{com,1}+\lambda_{c2}\mathcal{L}_{com,2}$
  - $\mathcal{L}_{rec}$ : multi-scale mel-spectrogram reconstruction loss
  - $\mathcal{L}_{adv}$ : adversarial loss
  - $\mathcal{L}_{feat}$ : feature matching loss
  - $\mathcal{L}_{com,1}$, $\mathcal{L}_{com,2}$ : semantic/residual quantizer에 대한 commitment loss
  - $\lambda_{rec}=45.0, \lambda_{adv}=1.0, \lambda_{feat}=1.0, \lambda_{c1}=25.0, \lambda_{c2}=5.0$ : weight

3. Experiments

- Settings

Dataset : LibriTTS
Comparisons : DAC, EnCodec, SpeechTokenizer, WavTokenizer, FACodec

- Results

전체적으로 SACodec의 성능이 가장 우수함

MUSHRA score 측면에서도 가장 우수한 성능을 보임

Semantic Representation Richness
- ARCH benchmark에 대해서도 우수한 결과를 보임

Ablation Study
- 각 component는 성능 향상에 유효함

Semantic anchoring을 활용하면 codebook utilization을 크게 향상할 수 있음

'Paper > Neural Codec' 카테고리의 다른 글

[Paper 리뷰] Speaking Clearly: A Simplified Whisper-based Codec for Low-Bitrate Speech Coding (0)	2026.03.26
[Paper 리뷰] SUNAC: Source-Aware Unified Neural Audio Codec (0)	2026.03.24
[Paper 리뷰] CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate (0)	2026.03.19
[Paper 리뷰] SwitchCodec: Adaptive Residual-Expert Sparse Quantization for High-Fidelity Neural Audio Coding (0)	2026.03.10
[Paper 리뷰] FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation (0)	2026.03.03

최근에 올라온 글

최근에 달린 댓글

« 2026/04 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] SACodec: Asymmetric Quantization with Semantic Anchoring for Low-Bitrate High-Fidelity Neural Speech Codecs

SACodec: Asymmetric Quantization with Semantic Anchoring for Low-Bitrate High-Fidelity Neural Speech Codecs

1. Introduction

2. Method

- Overall Framework

- Asymmetric Dual Quantizer

- Training Objective

3. Experiments

- Settings

- Results

'Paper > Neural Codec' 카테고리의 다른 글

티스토리툴바