[Paper 리뷰] CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

티스토리 뷰

Paper/Neural Codec

[Paper 리뷰] CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

feVeRin 2026. 3. 19. 12:53

CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynmaic Frame Rate

대부분의 neural codec은 fixed-frame rate에서 동작하므로 temporal mismatch가 존재함
CodecSlime
- Schedulable Dynamic Frame Rate를 활용하여 neural codec에서 temporal redundancy를 compress
- Melt-and-Cool training을 도입해 adaptation을 향상
논문 (ICASSP 2026) : Paper Link

1. Introduction

Neural speech codec은 lowest achievable frame rate에서 best possible quality의 speech signal을 reconstruct 하는 것을 목표로 함
- BUT, 기존의 neural codec은 Fixed Frame Rate (FFR)에서 동작하므로 speech의 inherently non-uniform temporal information density에 대해 mismatch가 발생함
- 이를 위해 Dynamic Frame Rate (DFR) approach를 고려할 수 있지만, general-purpose acoustic token을 생성하기 어렵고 training/inference pipeline이 복잡하다는 한계가 있음

-> 그래서 FFR codec backbone을 활용하여 plugin-style로 DFR codec으로 확장한 CodecSlime을 제안

CodecSlime
- Low frame rate를 위해 feature-space distortion metric optimization을 활용하여 downsampling scheme을 adaptively select하는 Schedulable Dynmaic Frame Rate (ScheDFR)을 도입
- Backbone-agnostic, general-purpose codec을 위해 Melt-and-Cool training을 활용

< Overall of CodecSlime >

ScheDFR, Melt-and-Cool training을 활용한 low frame rate, high-quality DFR codec
결과적으로 기존보다 우수한 성능을 달성

2. Method

CodecSlime은 low frame rate에서 high-quality reconstruction을 지원함
- Schedulable Dynamic Frame Rate (ScheDFR)은 추론 시 temporally similar feature를 aggregate 하여 low-loss compression을 지원함
- Melt-and-Cool training은 FFR backbone model을 2-stage process로 adapt함

- Preliminary

Architecture
- Backbone model은 encoder, quantizer, decoder, discriminator로 구성된 VQ-GAN architecture를 사용함
  1. Encoder $f_{E}(\cdot)$은 CNN, LSTM으로 구성되고, decoder $f_{D}(\cdot)$은 encoder를 mirror 함
  2. Discriminator는 DAC, HiFi-GAN을 따라 MPD, MS-STFT discriminator를 사용함
- 추가적으로 backbone-agnostic property를 확인하기 위해 다음 2가지의 quantizer를 고려함:
  1. Vector Quantizer (VQ)는 continuous feature $h\in \mathbb{R}^{d_{h}}$를 codebook의 nearest entry로 mapping 함
  2. Finite Scalar Quantizer (FSQ)는 vector element-wise로 quantization을 수행함
Training Objective
- Training objective는 reconstruction loss, GAN loss로 구성됨
- Reconstruction loss의 경우 multiple resolution scale에서 $L1$ distance를 compute 하는 multi-scale mel-spectrogram loss를 사용함
- GAN loss의 경우 least-squares GAN objective와 $L1$ feature matching loss를 사용함

- Schedulable Dynamic Frame Rate (ScheDFR)

Motivation
- Speech는 서로 다른 duration, paralinguistic feature를 가지는 phoneme sequence에 해당하므로 time에 따른 non-uniform information density가 나타남
  - BUT, 기존 FFR codec은 silence, sustained vowel 같은 less temporal variation에 많은 frame을 allocate 함
- 따라서 논문은 reconstruction quality를 preserve 하면서 FFR speech feature sequence를 lower target frame rate로 compress 하는 Schedulable Dynamic Frame Rate (ScheDFR)을 도입함
Problem Formalization
- ScheDFR은 encoder, quantizer 사이에 schedulable downsampling module을 insert 함
- 해당 module은 encoder output $\mathbf{h}^{T\times d_{h}}=f_{E}(\mathbf{x})$와 target downsampling ratio $R_{s}$를 input으로 하여 segmentation scheme $\mathbf{s}^{*}=\{s_{1},...,s_{T'}\}$을 output 함
  - $T'=\lceil T/R_{S}\rceil$
- Downsampling
  1. $\sum_{i=1}^{T'}s_{i}=T$, $1\leq s_{i}\leq U$에 대해 segment length sequence $\mathbf{s}=(s_{1},...,s_{T'})$이 주어진다고 하자
    - Start index $\sigma_{1}=1$이고 $\sigma_{i+1}=\sigma_{i}+s_{i}$
  2. 그러면 frame-averaged downsampling function $f_{down}:\mathbb{R}^{T\times d_{h}}\times \mathbb{N}^{T'}\rightarrow \mathbb{R}^{T\times d_{h}}$는:
    (Eq. 1) $\forall i\in [1,T'],\forall t\in[\sigma_{i},\sigma_{i}+s_{i}-1],\,\,\, h'_{t}=\frac{1}{s_{i}}\sum_{j=\sigma_{i}}^{\sigma_{i}+s_{i}-1}h_{j}$
  3. Output $\mathbf{h}'=f_{down}(\mathbf{h},\mathbf{s})$는 동일한 temporal length $T$를 가짐
    - 이는 $T'$ frame으로 compress 한 다음, upsampling 하는 것과 equivalent 함
  4. Duration을 preserve 하기 위해 각 merged frame은 $\lceil \log_{2}U\rceil$ bit를 additionally store 하여 content, duration을 decoupling 함
- Scheduling
  1. $\mathcal{S}=\{\mathbf{s}|\sum_{i=1}^{T'}s_{i}=T,1\leq s_{i}\leq U\}$라고 하자
  2. Optimal segmentation은:
    (Eq. 2) $\mathbf{s}^{*}=\arg\max_{\mathbf{s}\in\mathcal{S}}\mathcal{J}(\hat{\mathbf{x}}',\mathbf{x}), \,\,\, \text{where}\,\,\hat{\mathbf{x}}'=f_{D}(\mathbf{h}')$
    - $\mathcal{J}$ : reconstruction quality를 evaluate 하는 역할
  3. 이때 reconstruction quality metric을 direct optimization 하는 것은 non-differentiable 하거나 intractable 하므로 논문은 surrogate objective를 고려함
- Surrogate Objective
  1. Surrogate objective는 $\mathcal{J}_{h}(\mathbf{h},\mathbf{s})=-\sum_{t=1}^{T}|| h_{t}-h'_{t}||_{2}$와 같이 segmentation quality $\mathbf{s}$를 original, downsampled feature 간의 negative $L2$ distance로 취급함
  2. 그러면 optimal segmentation은 $\mathbf{s}^{*}=\arg\max_{\mathbf{s}\in\mathcal{S}}\mathcal{J}_{h}(\mathbf{h},\mathbf{s})$와 같이 얻어짐
DP-based Downsample Scheduler
- Surrogate objective $\mathcal{J}_{h}$의 optimization problem을 Dynamic Programing (DP)로 solve 함
- 먼저 $d[j,i]$를 first $j$ frame이 $i$ frame으로 downsample 되었을 때의 maximum objective value라 하고, $L(j,s)=\sum_{t=j-s+1}^{j}\left|\left|h_{t}-\frac{1}{s}\sum_{k=j-s+1}^{j}h_{k}\right|\right|_{2}$라고 하자
  1. 그러면 DP는 (Eq. 3)을 따라 $\mathcal{J}_{h}$를 maximize 함:
    (Eq. 3) $ d[0,0]=0, \,\,\,\,\,d[j,i]=\max_{1\leq s\leq U}\left\{ d[j-s, i-1]-L(j,s)\right\}$
    - $L$은 pre-process 될 수 있고, optimum은 $d[T,T']$과 같음
  2. 결과적으로 ScheDFR은 downsample scheduler를 통해 speech signal의 temporal redundancy를 adaptively identify 하고 compress 하여 low frame rate에서 high-quality reconstruction을 지원함

- Melt-and-Cool Training Recipe

Motivation
- ScheDFR을 FFR backbone에 direct apply 하면 high WER이 나타날 수 있음
  - Backbone이 merged-frame encoding, decoding에 대해 training 되지 않았기 때문
- 따라서 논문은 2-stage training인 Melt-and-Cool을 도입함
  1. Melt는 다양한 downsampled input을 사용하기 위해 random downsampling으로 post-training 함
  2. Cool은 specific $R_{S}, U$에 optimal ScheDFR을 적용하여 model을 fine-tuning 함
- Melt, Cool stage 모두 fixed-length randomly cropped speech segment를 사용함
Post-Training (Melt)
- 먼저 pre-trained FFR model에서 encoded feature를 randomly downsample 함
- 이때 melt manager는 다양한 length를 가지는 segment의 proportion을 control 하여 time에 따른 downsampling strength를 increase 함
  - 이를 통해 여러 downsampling scheme에 robust 한 DFR foundation model을 얻음
Fine-Tuning (Cool)
- Foundation model을 target $R_{S}, U$에 대해 ScheDFR 하에서 fine-tuning 함
- 각 training segment 별 optimal scheme을 얻기 위해 DP-based scheduler를 forward process 내에서 실행함
  - 이때 encoder는 freeze 되고 decoder만 update 함
- 결과적으로 fine-tuned DFR model은 Melt에서 얻은 robustness를 retaining 하면서 ScheDFR에 tailor 됨

3. Experiments

- Settings

Dataset : LibriSpeech
Comparisons : EnCodec, VARSTok, SNAC, BigCodec, LLM-Codec, TFC-Fine

- Results

전체적으로 CodecSlime의 성능이 가장 우수함

WER, PESQ 측면에서도 우수한 성능을 보임

Ablation Study
- ScheDFR, Melt-and-Cool 모두 성능 향상에 유효함

'Paper > Neural Codec' 카테고리의 다른 글

[Paper 리뷰] Speaking Clearly: A Simplified Whisper-based Codec for Low-Bitrate Speech Coding (0)	2026.03.26
[Paper 리뷰] SUNAC: Source-Aware Unified Neural Audio Codec (0)	2026.03.24
[Paper 리뷰] SwitchCodec: Adaptive Residual-Expert Sparse Quantization for High-Fidelity Neural Audio Coding (0)	2026.03.10
[Paper 리뷰] FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation (0)	2026.03.03
[Paper 리뷰] Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration Coding (0)	2026.02.11

최근에 올라온 글

최근에 달린 댓글

« 2026/07 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynmaic Frame Rate

1. Introduction

2. Method

- Preliminary

- Schedulable Dynamic Frame Rate (ScheDFR)

- Melt-and-Cool Training Recipe

3. Experiments

- Settings

- Results

'Paper > Neural Codec' 카테고리의 다른 글

티스토리툴바

티스토리 뷰

[Paper 리뷰] CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

document.addEventListener("DOMContentLoaded", function() { renderMathInElement(document.body, { delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false} ] });});

CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynmaic Frame Rate

1. Introduction

2. Method

- Preliminary

- Schedulable Dynamic Frame Rate (ScheDFR)

- Melt-and-Cool Training Recipe

3. Experiments

- Settings

- Results

'Paper > Neural Codec' 카테고리의 다른 글

티스토리툴바