[Paper 리뷰] VoCodec: An Efficient Lightweight Low-Bitrate Speech Codec

티스토리 뷰

Paper/Neural Codec

[Paper 리뷰] VoCodec: An Efficient Lightweight Low-Bitrate Speech Codec

feVeRin 2026. 5. 18. 12:54

VoCodec: An Efficient Lightweight Low-Bitrate Speech Codec

Low complexity, low latency neural codec이 필요함
VoCodec
- Vocos vocoder를 backbone으로 사용하여 complexity를 절감
- Speech enhancement capability를 extend 하기 위해 front end에 lightweight neural network를 cascade
논문 (ICASSP 2026) : Paper Link

1. Introduction

Neural codec은 encoder, decoder, quantizer module로 구성됨
- Encoder는 speech를 latent representation으로 compress 하고 decoder는 quantized vector로부터 waveform을 reconstruct 하고, quantizer는 encoder, decoder와 함께 end-to-end training 됨
- 특히 neural codec은 discrete token을 compression, reconstruction에 사용함
  - 구조적으로는 VQ-GAN architecture를 기반으로 perceptual quality 향상을 위한 discriminator를 도입함
- BUT, 기존 neural codec은 high complexity, non-causality로 인해 real-time communication의 한계가 있음

-> 그래서 low computational complexity neural codec인 VoCodec을 제안

VoCodec
- Vocos architecture를 기반으로 time-frequency domain에서 speech codec을 directly operate
- Lightweight speech enhancement model을 front end에 cascade 하여 dereverberation을 수행

< Overall of VoCodec >

Vocos architecture를 기반으로 한 low complexity, low latency neural codec
결과적으로 기존보다 우수한 성능을 달성

2. Method

- Generator

VoCodec은 time-frequency domain에서 동작하고 Vocos를 encoder-decoder backbone으로 채택함
- Speech는 time-frequency domain에서 highly pronounced harmonic structure를 가지고 STFT/iSTFT를 통해 downsampling/upsampling을 single step으로 수행할 수 있기 때문
- 먼저 speech signal $x\in\mathbb{R}^{L}$이 주어지면, STFT를 통해 frequency, frame axis $F, T$에 대한 complex spectrum $X\in \mathbb{C}^{F\times T}$로 transform 함
  1. 이후 complex spectrum $X$의 logarithmic magnitude와 phase를 추출한 다음, frequency axis를 따라 concatenate 하여 input feature $Z_{in}$으로 사용함:
    (Eq. 1) $ Z_{in}=\text{Concat}\left(\log (|X|), \text{angle}\left(X_{i},X_{r}\right)\right)\in \mathbb{R}^{2F\times T}$
    - $X_{r},X_{i}$ : real/imaginary part, $|\cdot|$ : complex value에 대한 norm, $\text{Concat}(\cdot)$ : concatenation operation
  2. Complexity를 줄이기 위해 fully-connected layer를 통해 $Z_{in}$을 low-dimensional space로 project 함
- Encoder는 WavTokenizer를 따라 $M$ stacked ConvNeXt block과 attention module로 구성됨
  1. 각 ConvNeXt block은 depthwise convolution을 통해 higher dimensionality로 project 한 다음, 두 개의 pointwise convolution으로 project back 하는 inverted bottleneck으로 구성됨
  2. Attention module은 VoCodec의 sequence modeling을 향상하기 위해 $N$ basic ResNet block을 incorporate 하고 self-attention block을 add 함
- Quantizer는 Residual Vector Quantizer (RVQ)를 사용함
  - 특히 DAC를 따라 factorized code와 $L2$-normalization을 적용함
- Decoder는 Encoder의 mirror로써 computational complexity를 위해 ConvNeXt의 inverted design을 remove 하고 ResNet block에 group convolution을 적용함
  - Network는 complex spectral coefficient를 생성하고, speech는 iSTFT를 통해 reconstruct 됨

- Discriminator

VoCodec은 time-frequency domain에서 동작하므로 multi-scale STFT discriminator를 적용할 수 있음
- 이때 window length는 $[128, 256, 512, 1024, 2048]$, hop size는 $\texttt{window length}/4$로 fix 됨
- Multi-scale discriminator, multi-period discriminator 등은 사용되지 않음

- Combined Enhancement and Compression

Time-frequency domain masking에 기반한 lightweight speech enhancement model을 codec의 front end에 integrate 하여 noise interference와 reverberation을 줄일 수 있음
- 이를 위해 논문은 UL-UNAS model과 VoCodec을 cascade 함
  - 먼저 speech signal $x$는 UL-UNAS를 통과하여 enhanced spectrum $X_{enh}$를 생성하고, 이후 $X_{enh}$는 preprocess 되어 VoCodec으로 전달됨
- 이때 UL-UNAS의 parameter를 fix 하고 각 model을 independently train 한 다음, VoCodec을 fine-tuning 함

- Loss Functions

먼저 UL-UNAS training을 위한 loss function은
- Negative Scale Invariant SNR (SI-SNR) loss $\mathcal{L}_{SI\text{-}SNR}$, power-compressed spectrum loss $\mathcal{L}_{mag}$, $\mathcal{L}_{real/imag}$로 구성됨:
  (Eq. 2) $ \mathcal{L}_{SI\text{-}SNR}(\hat{x},x)=-\log_{10}\left( \frac{||\hat{x}_{t}||_{2}^{2}}{||\hat{x}-\hat{x}_{t}||_{2}^{2}}\right); \,\,\, \hat{x}_{t}=\frac{\langle \hat{x},x\rangle x}{||x||_{2}^{2}}$
  (Eq. 3) $\mathcal{L}_{mag}(\hat{X},X)=\left|\left| |\hat{X}|^{0.3}-|X|^{0.3}\right|\right|_{2}^{2}$
  (Eq. 4) $\mathcal{L}_{real/imag}(\hat{X},X)=\left|\left| \frac{\hat{X}_{r/i}}{|\hat{X}|^{0.7}} -\frac{X_{r/i}}{|X|^{0.7}}\right|\right|_{2}^{2}$
  - $x,\hat{x}$ : clean/enhanced speech, $X,\hat{X}$ : clean/enhanced spectrogram
  - $r, i$ : spectrogram의 real/imaginary part, $\langle\cdot,\cdot\rangle$ : inner product operator
- 그러면 general loss function $\mathcal{L}_{sc}$는:
  (Eq. 5) $\mathcal{L}_{sc}=\lambda_{1}\mathcal{L}_{SI\text{-}SNR}(\hat{x},x)+\lambda_{2}\mathcal{L}_{mag}(\hat{X},X)+\lambda_{3}\left(\mathcal{L}_{real}(\hat{X},X)+\mathcal{L}_{imag}(\hat{X},X)\right)$
  - $\lambda_{1},\lambda_{2},\lambda_{3}$ : weight
- VoCodec에서 generator loss $\mathcal{L}_{generator}$는 다음과 같이 구성됨
  1. Reconstruction loss $\mathcal{L}_{rec}$를 위한 multi-scale mel-spectrogram loss:
    (Eq. 6) $\mathcal{L}_{rec}=\left|\left| \log\left( \mathcal{M}(x)\right)-\log \left(\mathcal{M}(\hat{x})\right)\right|\right|_{1}$
    - $x,\hat{x}$ : target, reconstructed speech, $\mathcal{M}(\cdot)$ : mel-spectrogram transform
  2. Adversarial loss $\mathcal{L}_{g}$:
    (Eq. 7) $\mathcal{L}_{g}=||1-D(\hat{x})||_{2}^{2}$
    - $D(\cdot)$ : discriminator output
  3. Feature matching loss $\mathcal{L}_{feat}$:
    (Eq. 8) $\mathcal{L}_{feat}=2\sum_{l}\left|\left|D^{l}(x)-D^{l}(\hat{x})\right|\right|_{1}$
    - $D^{l}(\cdot)$ : $l$-th discriminator layer의 feature map
  4. 최종적으로 codebook loss $\mathcal{L}_{code}$, commitment loss $\mathcal{L}_{c}$를 포함한 final generator loss $\mathcal{L}_{generator}$는:
    (Eq. 9) $\mathcal{L}_{generator}=\lambda_{rec}\mathcal{L}_{rec}+\lambda_{g}\mathcal{L}_{g} +\lambda_{feat}\mathcal{L}_{feat} + \lambda_{code}\underset{\mathcal{L}_{code}}{\underbrace{\left|\left| \text{sg}[\mathbf{z}_{e}]-\mathbf{e}_{k}\right|\right|_{2}^{2}}}+\lambda_{c}\underset{\mathcal{L}_{c}}{\underbrace{\left|\left| \mathbf{z}_{e}-\text{sg}[\mathbf{e}_{k}]\right|\right|_{2}^{2}}}$
    - $\text{sg}[\cdot]$ : stop-gradient operation, $\mathbf{e}_{k}$ : codebook vector
    - $\lambda_{rec},\lambda_{g},\lambda_{feat}, \lambda_{code},\lambda_{c}$ : weight
- Discriminator는 adversarial loss $\mathcal{L}_{d}$로 separately train 됨:
  (Eq. 10) $\mathcal{L}_{d}=||1-D(x)||_{2}^{2}+||D(\hat{x})||_{2}^{2}$
- Training 시 mel-spectrogram은 $[32, 64, 128, 256, 512, 1024, 2048]$의 multiple window length로 compute 되고 fixed hop size는 $\texttt{window length}/4$로 설정됨
  - Mel bin size는 $[5, 10, 20, 40, 80, 160, 320]$을 사용함

3. Experiments

- Settings

Dataset : LRAC
Comparisons : WavTokenizer

- Results

전체적으로 VoCodec의 성능이 가장 우수함

Subjective evaluation 측면에서도 우수한 성능을 보임

'Paper > Neural Codec' 카테고리의 다른 글

[Paper 리뷰] SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization (0)	2026.07.01
[Paper 리뷰] EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding (0)	2026.05.20
[Paper 리뷰] IBPCodec: A Low-Bitrate Lightweight Speech Codec with Inter-Band Prediction (0)	2026.05.13
[Paper 리뷰] STACodec: Semantic Token Assignment for Balancing Acoustic Fidelity and Semantic Information in Audio Codecs (0)	2026.05.07
[Paper 리뷰] StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs (0)	2026.04.28

최근에 올라온 글

최근에 달린 댓글

« 2026/07 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] VoCodec: An Efficient Lightweight Low-Bitrate Speech Codec

VoCodec: An Efficient Lightweight Low-Bitrate Speech Codec

1. Introduction

2. Method

- Generator

- Discriminator

- Combined Enhancement and Compression

- Loss Functions

3. Experiments

- Settings

- Results

'Paper > Neural Codec' 카테고리의 다른 글

티스토리툴바

티스토리 뷰

[Paper 리뷰] VoCodec: An Efficient Lightweight Low-Bitrate Speech Codec

document.addEventListener("DOMContentLoaded", function() { renderMathInElement(document.body, { delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false} ] });});

VoCodec: An Efficient Lightweight Low-Bitrate Speech Codec

1. Introduction

2. Method

- Generator

- Discriminator

- Combined Enhancement and Compression

- Loss Functions

3. Experiments

- Settings

- Results

'Paper > Neural Codec' 카테고리의 다른 글

티스토리툴바