[Paper 리뷰] SwitchCodec: Adaptive Residual-Expert Sparse Quantization for High-Fidelity Neural Audio Coding

티스토리 뷰

Paper/Neural Codec

[Paper 리뷰] SwitchCodec: Adaptive Residual-Expert Sparse Quantization for High-Fidelity Neural Audio Coding

feVeRin 2026. 3. 10. 10:45

SwitchCodec: Adaptive Residual-Expert Sparse Quantization for High-Fidelity Neural Audio Coding

Neural codec은 fixed codebook으로 인한 suboptimality가 존재함
SwitchCodec
- Residual Expert Vector Quantization을 기반으로 bitrate를 codebook capacity와 decoupling 하고 각 quantizer의 utilization을 향상
- Variable-bitrate mechanism을 통해 expert quantizer를 adjust하여 추론 시 re-training 없이도 multi-bitrate operation을 지원
논문 (ICASSP 2026) : Paper Link

1. Introduction

Neural audio codec은 주로 Vector Quantization Variational AutoEncoder (VQ-VAE)를 기반으로 data-driven discrete audio representation을 학습하여 compression, perceptual quality를 end-to-end optimize 함
- 특히 SoundStream, DAC, HiFi-Codec 등은 Residual Vector Quantization (RVQ)를 활용하여 highi-fidelity audio compression을 지원함
  - BUT, 해당 neural codec은 fixed quantization structure로 인한 한계가 있음
- Fixed quantization structure의 rigidity를 개선하기 위해 Mixture-of-Expert Quantization (MoE-VQ), Adaptive RVQ 등을 고려할 수 있지만, order decoupling과 data-driven expert allocation의 문제가 있음

-> 그래서 기존 neural codec의 fixed quantization 문제를 개선한 SwitchCodec을 제안

SwitchCodec
- Residual Experts Vector Quantization (REVQ)를 활용하여 bitrate와 codebook capacity를 decouple
- 추가적으로 Variable-Bitrate (VBR) mechanism을 도입하여 추론 시 active expert 수를 adjust

< Overall of SwitchCodec >

REVQ와 VBR mechanism을 활용한 adaptive neural codec
결과적으로 기존보다 우수한 성능을 달성

2. Method

- Encoder-Decoder Architecture

SwitchCodec은 DAC의 hierarchical convolutional backbone을 기반으로 함
- Encoder는 $7\times 1$ front-end, 4 downsampling block, 1024-dimensional latent를 생성하는 $3\times 1$ projection으로 구성됨
- Decoder는 encoder를 mirror 하고 transposed convolution, Tanh를 가지는 final $7\times 1$ layer를 통해 waveform reconstruction을 수행함

- Residual Experts Vector Quantization

Standard RVQ는 fixed quantizer sequence를 사용하여 latent representation을 process 함
- 이로 인해 low bitrate에서는 limited codebook이 diverse latent $Z$를 represent 하기 어려우므로 quantization error가 발생함
  - 반면 best-suited quantizer를 select 하는 adaptive strategy를 사용하면 accurate reconstruction을 얻을 수 있으므로, 논문은 Residual Expert Vector Quantization (REVQ)를 도입함
- REVQ는 shared quantizer에 router가 select 한 sparsely activated routed quantizer를 augment 하여 residual hierarchy를 preserve 해 quantization error를 reduce 함
  1. 먼저 각 audio segment에 routed quantizer를 assign 하기 위해 DeepSeek-V3를 따라 gating network (router)과 affinity score를 compute 하기 위한 bias-free learnable matrix $U^{\top}\in\mathbb{R}^{D\times N_{r}}$을 도입함
  2. Encoder의 transposed output을 $Z'\in\mathbb{R}^{T\times D}$라 할 때, affinity score $S$와 $mask_{i}$는:
    (Eq. 1) $S=\frac{1}{T}\sum_{t=1}^{T}(Z'\cdot U^{\top})$
    (Eq. 2) $ mask_{i}=\left\{\begin{matrix} 1, & S_{i}\in\text{TopK}(\{S_{j}|1\leq j\leq N_{r}\}, k_{r}) \\ 0, & \text{otherwise} \\ \end{matrix}\right.$
    - $N_{r}$ : routed quantizer 수, $T$ : frame 수, $D$ : latent dimension
    - $\text{TopK}(S,k)$ : top-$k$ score를 select 하는 operation
- Resulting $mask_{i}$는 encoder, decoder 모두에서 routed quantizer를 select 하므로, window 당 select 된 $k_{r}$ routed quantizer의 identity를 transmit 해야 함
  1. 이로 인해 window 당 $\left\lceil \log_{2}\binom{N_{r}}{k_{r}}\right \rceil$의 overhead가 발생함
  2. $W$-s window에 대해, 해당 overhead는 $\left\lceil \log_{2}\binom{N_{r}}{k_{r}}\right \rceil/W$ bps로 amortize 됨
    - e.g.) $N_{r}=7$ routed quantizer, $k_{r}=2$, $2$-s window의 경우, $\log_{2}\binom{7}{2}/2\approx 2.2$bps가 add 되고, 이는 2.67 kbps의 $0.1\%$ 이하에 해당함
- REVQ는 quantizer selection을 application order와 decoupling 함
  1. 특히 $k_{r}$ routing quantizer subset은 routing score에 따라 adaptively select 되지만 application은 pre-determined sequence를 따름
    - 여기서 chosen quantizer는 selection score에 따라 적용되지 않고, original ascending index의 fixed sequence에 따라 적용됨
  2. 해당 stringent, index-based application은 selected group 내에서 lower-indexed quantizer가 higher-energy component를 먼저 modeling 하도록 보장함
  3. 결과적으로 REVQ는 routing mechanism이 high-energy latent를 lower-index quantizer로 mapping 하도록 학습되므로 interpretability를 향상할 수 있음
    - 추가적으로 각 routing quantizer에 specialized, non-overlapping role을 assign 하여 training stability를 향상할 수 있음
- 한편으로 quantization process에서 mask는 non-differentiable 하므로 backpropagation을 위해 Straight-Through estimator를 적용함:
  (Eq. 3) $mask=S+\text{sg}(mask-S)$
  - $\text{sg}$ : stop-gradient operation

- Variable Bitrate Support

논문은 affinity에 따라 top-$k$를 select하는 방식으로 input 당 active routed quantizer 수를 sampling 해 variable bitrate operation을 수행함
- 즉, latent feature로부터 affinity score를 compute 하는 gating network의 content-aware nature를 활용하여 selected quantizer가 audio complexity와 match 되도록 함
- 이를 통해 SwitchCodec은 REVQ architecture를 preserve 하면서 bit-per-second를 direct control 할 수 있음
  1. 이때 encoder-decoder, codebook은 bitrate에 관계없이 unchange 됨
  2. Auxiliary data로는 active quantizer를 identify 하는 routing mask를 사용함
    - 해당 overhead는 second-level windowing을 통해 total bitrate의 $0.1\%$ 이하로 amortize 됨
- Single model은 추론 시 $k_{r}$을 adjust 하여 0.89 kbps에서 8 kpbs까지의 broad bitrate range를 지원함
  - 즉, target bitrate에 관계없이 network weight가 fix 되어 있으므로 multi-model VBR scheme의 memory, latency overhead를 eliminate 할 수 있음
- Window 당 transmit 되는 routing mask는 combinational coding을 사용하여 $k_{r}=2$, $N_{r}=7$의 routed quantizer에 대해, $\binom{7}{2}=21$ combination을 얻고 5 bit로 represent 함

3. Experiments

- Settings

Dataset : VCTK, LibriTTS, FMA, CommonVoice
Comparisons : EnCodec, DAC

- Results

전체적으로 SwitchCodec의 성능이 가장 우수함

Spectrogram 측면에서도 우수한 reconstruction이 가능함

Quantizer Analysis
- $N_{r}=9$의 quantizer를 사용하는 경우 최적의 성능을 달성할 수 있음

'Paper > Neural Codec' 카테고리의 다른 글

[Paper 리뷰] SUNAC: Source-Aware Unified Neural Audio Codec (0)	2026.03.24
[Paper 리뷰] CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate (0)	2026.03.19
[Paper 리뷰] FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation (0)	2026.03.03
[Paper 리뷰] Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration Coding (0)	2026.02.11
[Paper 리뷰] Scaling Transformers for Low-Bitrate High-Quality Speech Coding (0)	2026.01.29

최근에 올라온 글

최근에 달린 댓글

« 2026/07 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] SwitchCodec: Adaptive Residual-Expert Sparse Quantization for High-Fidelity Neural Audio Coding

SwitchCodec: Adaptive Residual-Expert Sparse Quantization for High-Fidelity Neural Audio Coding

1. Introduction

2. Method

- Encoder-Decoder Architecture

- Residual Experts Vector Quantization

- Variable Bitrate Support

3. Experiments

- Settings

- Results

'Paper > Neural Codec' 카테고리의 다른 글

티스토리툴바

티스토리 뷰

[Paper 리뷰] SwitchCodec: Adaptive Residual-Expert Sparse Quantization for High-Fidelity Neural Audio Coding

document.addEventListener("DOMContentLoaded", function() { renderMathInElement(document.body, { delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false} ] });});

SwitchCodec: Adaptive Residual-Expert Sparse Quantization for High-Fidelity Neural Audio Coding

1. Introduction

2. Method

- Encoder-Decoder Architecture

- Residual Experts Vector Quantization

- Variable Bitrate Support

3. Experiments

- Settings

- Results

'Paper > Neural Codec' 카테고리의 다른 글

티스토리툴바