[Paper 리뷰] DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation

티스토리 뷰

Paper/Neural Codec

[Paper 리뷰] DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation

feVeRin 2025. 7. 5. 07:35

DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation

Neural audio codec은 frame rate와 audio quality 간의 trade-off를 가짐
DualCodec
- Self-Supervised Learning representation과 waveform representation을 integrate
- First-layer codec의 semantic information을 향상하고 low frame rate에서 동작
논문 (INTERSPEECH 2025) : Paper Link

1. Introduction

Neural audio codec은 audio signal을 discrete code series로 compress 하는 것을 목표로 함
- 특히 VALL-E와 같은 Large Language Model (LLM)-based Text-to-Speech (TTS)에서, EnCodec과 같은 neural codec은 Residual Vector Quantization (RVQ)를 통해 discretization을 수행함
  - BUT, 해당 LLM-based TTS는 inaccurate speech content, slow inference speed 등의 문제가 있음
- 따라서 practical speech generation-oriented neural codec은 다음을 만족해야 함:
  1. Semantic Enhancement : SpeechTokenizer와 같이 rich pronunciation, semantic information을 가지는 Self-Supervised Learning (SSL) feature를 활용할 수 있어야 함
  2. Low Frame Rate : TTS inference time을 줄이기 위해서는 low frame rate를 가져야 함
  3. Audio Quality : Low bitrate에서도 high reconstruction을 보장해야 함
- BUT, 기존의 codec model은 아래 표와 같이 semantically-rich하지 않거나 high bitrate에서만 동작함

-> 그래서 low frame rate에서 동작하면서 semantically-rich한 neural audio codec인 DualCodec을 제안

DualCodec
- SSL과 waveform representation을 dual encoding framework에서 unify
- 특히 해당 framework에서 first-layer token을 semantically enhance

< Overall of DualCodec >

SSL과 waveform representation을 combine한 neural audio codec
결과적으로 기존보다 우수한 성능을 달성

2. Method

DualCodec은 SSL encoding stream과 Waveform encoding stream으로 구성됨
- 먼저 각 stream은:
  1. SSL encoding stream은 SSL feature를 directly encoding하여 first-layer codec token으로 semantic-rich information을 capture 함
  2. Waveform encoding stream은 DAC framework를 활용하여 high-quality audio를 encode/decode 함
    - 여기서 논문은 두 stream 모두에 downsampling을 적용하여 low frame rate를 달성하고, 두 stream을 jointly optimize 함
- 이후 DualCodec은 2개의 encoding stream을 사용하여 semantic rich RVQ-1 token을 얻은 다음, 나머지 layer (RVQ-rest)에서는 waveform feature의 remaining acoustic aspect에 focus 함
  - 해당 disentanglement는 RVQ-1 feature를 waveform feature에서 substract 하여 얻어짐
- 최종적으로 RVQ-1 feature는 RVQ-rest codebook vector에 re-sum 되어 audio를 decode 함

- SSL Encoding

SSL encoding stream은 pre-trained SSL model, ResNet encoder, downsampler, Vector Quantization (VQ) module, ResNet decoder로 구성됨
- 해당 architecture는 VQ-VAE를 SSL feature discretization에 적용한 RepCodec을 따름
- SSL Model
  1. 논문은 rich pronunciation, word meaning information을 포함한 W2V-BERT의 normalized 16-th layer feature를 사용함
  2. 해당 model은 600M-parameter Transformer를 통해 16kHz waveform에서 50Hz feature를 추출함
    - SSL model은 training/inference 시 frozen 됨
- Downsampler
  1. 논문은 simple 1D average pooling을 사용하여 50Hz feature를 codec frame rate로 downsample 함
    - $\text{kernel_size}=\text{stride_size}=\text{downsampling_factor}$
  2. 25Hz target frame rate에 대해 $\text{downsampling_factor}=2$, 12.5Hz의 경우 $4$로 설정함
- ResNet Encoder and Decoder
  1. ResNet encoder, decoder는 VQ module 이전/이후의 SSL feature를 process 하는 데 사용됨
    - 이를 통해 VQ token이 complex semantic pattern을 capture 하도록 함
  2. 구조적으로는 stacked ConvNeXt를 사용하고 각 network는 13M parameter를 가짐
    - 해당 ResNet module은 down-/up-sampling operation을 가지지 않음
- VQ Module
  1. VQ module은 ResNet Encoder output $\mathbf{Z}_{ssl}\in\mathbb{R}^{H\times T}$를 1D token sequence $\text{RVQ}_{1}\in \mathbb{Z}^{1\times T}$로 discretize 함
    - $H$ : hidden dimension, $T$ : feature length
  2. $\text{RVQ}_{1}$은 projected input의 closest codebook vector를 find 하여 compute 됨:
    (Eq. 1) $\text{RVQ}_{1}=\arg\min_{k}\left|\left| \ell_{2}\left(W_{in}\mathbf{Z}_{ssl}\right)-\ell_{2} (e_{k})\right|\right|_{2}$
    - $W_{in}\in \mathbb{R}^{D\times H}$ : $D=8, H=1024$의 input projection matrix
    - $\ell_{2}$ : $L2$ normalization, $e_{1},e_{2},...,e_{k}$ : codebook vector, $e_{k}\in\mathbb{R}^{H\times T}$
  3. RVQ-1 feature는 ResNet Decoder를 통해 얻어짐:
    (Eq. 2) $\text{RVQ}_{1\text{-}feat} = \text{ResNet}(e_{k})$

- Waveform Encoding

Waveform encoding을 위해서는 RVQ module, Codec Encoder/Decoder를 활용함
- Codec Encoder and Decoder
  1. Codec encoder/decoder는 snake activation function을 가지는 CNN network로 구성됨
  2. Encoder는 waveform을 strided convolution을 통해 frame rate로 downsample 하고 decoder는 strided convolution을 upsampling convolution으로 replace 하여 encoder를 mirror 함
- Frame Rate
  1. DualCodec은 24kHz waveform을 input으로 사용함
  2. 25Hz frame rate token을 output 하기 위해 codec encoder는 4 CNN block과 stride $(4,5,6,8)$을 사용함
    - 즉, $24000\text{Hz}\div (4\times 5\times 6\times 8)=25\text{Hz}$
  3. 12.5Hz의 경우, $(4,5,6,8,2)$의 stride를 사용함
- RVQ Module
  1. RVQ module은 $N-1$ layer로 구성되고, 각 VQ layer는 previous layer의 residual error를 quantize 함
    - 즉, module input으로 waveform feature와 $\text{RVQ}_{1\text{-}feat}$ 간의 residual을 사용하여 $\text{RVQ}_{rest}\in\mathbb{Z}^{(N-1)\times T}$로 discretize 함
  2. $\text{RVQ}_{rest}$ token을 얻은 다음, 각 selected codebook vector $e_{k}$는 $\text{RVQ}_{1\text{-}feat}$와 add 됨
    - 해당 continuous feature는 SSL encoding과 waveform encoding을 summarize 하고 codec decoder input으로 사용됨
  3. 특히 논문은 training 시 RVQ dropout을 도입해 매번 randomly choice 되는 $q\in [0,N-1]$의 first $q$ RVQ quantizer만 사용함
    - $q=0$일 때는 SSL encoding stream만 사용되고, RVQ-1 token만 decode 함

- Training Objective

Dual encoding framework는 end-to-end training 되고, 다음의 loss를 포함함:
- SSL Reconstruction Loss
  - Reconstructed SSL feature와 input SSL feature 간의 MSE loss로써, 12.5Hz/25Hz downsampled version의 SSL feature를 사용함
- Spectrogram Reconstruction Loss
  - Input, reconstructed audio 간의 multi-scale mel-spectrogram loss와 같음
- Quantization Loss
  - Codebook은 quantized, unquantized feature의 $L1$ loss로 update 되고, commitment loss와 straight-through estimator가 추가됨
- Adversarial Loss
  - EnCodec과 같이 Multi-Period Discriminator (MPD)와 Multi-Scale STFT Discriminator (MS-STFTD)를 사용하여 generated/ground-truth sample의 모든 intermediate layer에서 $L1$ feature matching loss를 적용함

3. Experiments

- Settings

Dataset : Emilia
Comparisons : DAC, EnCodec, SpeechTokenizer, WavTokenizer, Mimi

- Results

전체적으로 DualCodec의 성능이 가장 뛰어남

Semantic Content Analysis
- Dual encoding을 활용하면 WER을 크게 향상할 수 있음

TTS Analysis
- VALL-E와 SoundStorm에 대해 DualCodec을 사용하는 경우 최고의 TTS 성능을 달성할 수 있음

'Paper > Neural Codec' 카테고리의 다른 글

[Paper 리뷰] DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech Codec (0)	2025.07.21
[Paper 리뷰] LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec (0)	2025.07.11
[Paper 리뷰] UniCodec: Unified Audio Codec with Single Domain-Adaptive Codebook (0)	2025.07.01
[Paper 리뷰] ALMTokenizer: A Low-Bitrate and Semantic-Rich Audio Codec Tokenizer for Audio Language Modeling (0)	2025.06.22
[Paper 리뷰] SECodec: Structural Entropy-based Compressive Speech Representation Codec for Speech Language Models (0)	2025.05.31

최근에 올라온 글

최근에 달린 댓글

« 2026/03 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation

DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation

1. Introduction

2. Method

- SSL Encoding

- Waveform Encoding

- Training Objective

3. Experiments

- Settings

- Results

'Paper > Neural Codec' 카테고리의 다른 글

티스토리툴바