[Paper 리뷰] SPCodec: Split and Prediction for Neural Speech Codec

티스토리 뷰

Paper/Neural Codec

[Paper 리뷰] SPCodec: Split and Prediction for Neural Speech Codec

feVeRin 2025. 8. 29. 17:08

SPCodec: Split and Prediction for Neural Speech Codec

기존 neural codec은 서로 다른 frequency band 간의 correlation을 fully exploit 하지 못함
SPCodec
- Latent split-and-prediction scheme을 활용한 group residual vector quantization module을 도입
- Low-/high-frequency representation을 disentangle 하여 feature redundancy를 reduce
논문 (INTERSPEECH 2025) : Paper Link

1. Introduction

Speech codec은 일반적으로 encoder, quantizer, decoder로 구성됨
- 특히 SoundStream, EnCodec, DAC 등은 Residual Vector Quantization (RVQ) 기반의 convolutional encoder-decoder를 활용하여 end-to-end manner로 training 됨
- BUT, 해당 neural codec은 high-dimensional codebook을 사용하여 entire latent embedding을 quantize 하므로 latent embedding과 speech characteristic 간의 relationship을 interpret 하기 어려움
  - 추가적으로 latent space의 redundancy를 fully exploit 하지 못함
- 이를 위해 MBD, LightCodec과 같이 signal을 multiple subband로 divide 하거나 HiFi-Codec, SRCodec과 같이 latent embedding을 partitioning 하는 방식을 고려할 수 있음
  - BUT, 여전히 spectral characteristic을 반영하기 어려움

-> 그래서 spectral characteristic을 고려한 neural codec인 SPCodec을 제안

SPCodec
- Single model을 사용하여 다양한 frequency range를 support
- Latent feature split-and-prediction scheme을 통해 low-/high-frequency component를 explicitly disentangle 하여 spectral characteristic을 align 하고 interpretability를 향상

< Overall of SPCodec >

Latent feature split-and-prediction scheme을 활용한 nerual codec
결과적으로 기존보다 우수한 성능을 달성

2. Method

- SPCodec Framework

논문은 single-channel time-domain input signal $x$를 고려함
- 구조적으로 end-to-end (E2E) neural codec은 encoder $\text{Enc}_{t}$, quantizer $Q$, decoder $\text{Dec}_{t}$로 구성됨
  1. 먼저 encoder는 $x$를 latent embedding sequence로 transform 함:
    (Eq. 1) $e=\text{Enc}_{t}(x)$
  2. Quantizer는 각 embedding을 finite codebook entry set에 assign 하고 codebook index를 사용하여 compress 함:
    (Eq. 2) $\hat{e}=Q(e)$
  3. Decoder는 quantized latent embedding을 사용하여 time-domain signal을 reconstruct 함:
    (Eq. 3) $\hat{x}=\text{Dec}_{t}(\hat{e})=\text{Dec}_{t}(Q(\text{Enc}_{t}(x)))$
- SPCodec은 SoundStream, DAC를 따라 Residual Vector Quantization (RVQ) 기반의 fully convolutional encoder-decoder architecture로 구성됨
- 이때 lower frequency는 subjective quality에 큰 영향을 미치고, higher frequency는 lower frequency를 기반으로 predict 할 수 있으므로:
  1. 논문은 latent embedding을 group으로 divide 하고 각 group을 specific frequency band와 associate 하는 constraint를 적용함
  2. Prediction module을 통해 quantized low-frequency embedding에서 high-frequency embedding을 decorrelate 하여 redundancy를 reduce 함
  3. Prediction module을 사용하여 quantized low-frequency embedding으로부터 high-frequency embedding을 reconstruct 함
- Latent embedding은 2개의 group으로 $\text{split}(e)=[e_{l},e_{h}]$와 같이 divide 됨
  1. 여기서 $e_{l}$은 low-frequency portion을 reconstruct 하기 위한 feature를 포함하고, $e_{h}$는 high-frequency portion을 reconstruct 하기 위한 feature를 포함함
    - 특히 high-frequency embedding은 quantized low-frequency embedding으로부터 predict 할 수 있음
  2. Latent embedding은 unquantized high-frequency embedding과 quantized low-frequency embedding으로부터 predict 됨:
    (Eq. 4) $e_{hp}=\text{Pred}_{e}(\hat{e}_{l},e_{h})$
  3. 한편으로 논문은 low-/high-frequency component를 위해 2개의 separate quantizer를 사용함:
    (Eq. 5) $\hat{e}_{l}=Q_{l}(e_{l}),\hat{e}_{hp}=Q_{h}(e_{hp})$
  4. Decoder에 전달되기 전에 quantization을 통해 얻어지는 high-frequency embedding은:
    (Eq. 6) $\hat{e}_{h}=\text{Pred}_{d}(\hat{e}_{l},\hat{e}_{hp})$
  5. $\hat{e}_{l},\hat{e}_{h}$은 complete signal을 reconstruct 하기 위해 concatenate 되거나 decoder $\text{Dec}_{t}$를 통해 low-/high-frequency portion을 reconstruct 하는 데 사용될 수 있음:
    (Eq. 7) $\hat{x}=\text{Dec}_{t}\left([\hat{e}_{l},\hat{e}_{h}]\right)$
    (Eq. 8) $\hat{x}_{l}=\text{Dec}_{t}\left([\hat{e}_{l},0]\right)$

- Latent Split and Prediction (SP)

Latent embedding $e$는 channel dimension을 따라 $e_{l},e_{h}$로 split 됨
- SPCodec에서는 split latent에서 생성된 reconstructed waveform을 supervise 하여 feature disentanglement를 지원하고 latent feature와 spectral range 간의 correspondence를 establishing 함
  - 이때 low-frequency feature의 content와 resolution을 large share로 contribute 하기 위해 $e_{l}$ dimension을 $e_{h}$보다 크게 설정함
- Attention-based module은 low-frequency feature에서 high-frequency feature까지의 feature prediction을 지원하기 위해 사용됨
  1. Low-frequency feature input $\text{in}_{l}$은 high-frequency feature $\text{Pred}_{h}$를 predict 하는데 필요한 mask를 생성하는 데 사용됨
  2. Predicted high-frequency feature는 encoder의 feature redundancy를 eliminate 하거나 decoder에서 feature merging을 facilitate 하기 위해 사용됨
    - 이를 통해 low-frequency feature로부터 predictable 한 high-frequency feature를 bitstream에 포함하지 않을 수 있음
- Feature Transform (FT)는 input high-frequency feature를 convolution layer를 통해 linearly transform 하고, Mask Generator (MG)는 convolution layer, non-linear activation을 사용하여 attention mask를 calculate 함:
  (Eq. 9) $\text{Pred}_{h}=\text{FT}(\text{in}_{h})\cdot \text{MG}(\text{in}_{l})$
  (Eq. 10) $\text{Pred}_{out}=\text{in}_{h}\pm\text{Pred}_{h}$

- Training Paradigm

SPCodec은 discriminator와 함께 end-to-end로 training 되고, reconstruction loss와 adversarial loss를 combining 하여 perceptual quality를 향상함
- Reconstruction loss는 waveform $L1$ distance와 log-spectrogram loss, log-mel-spectrogram loss를 사용함:
  (Eq. 11) $\mathcal{L}_{rec}=\lambda_{wav}\mathcal{L}_{wav}+\lambda_{spec}\mathcal{L}_{spec}+\lambda_{mel}\mathcal{L}_{mel}$
- Adversarial training을 위해 논문은 multi-period discriminator $\text{Dis}_{wav}$와 complex multi-scale band-splitting STFT discriminator $\text{Dis}_{stft}$의 2가지 discriminator를 사용함
  1. 특히 perceptual quality를 향상하기 위해, multiple discriminator와 time frame로 average 된 discriminator logit의 $L2$ loss로 formulate 되는 adversarial loss $\mathcal{L}_{adv}$를 도입함
  2. Feature loss $\mathcal{L}_{feat}$의 경우, generated audio에 대한 discriminator internal layer output과 target audio 간의 average absolute difference로 얻어짐
  3. 추가적으로 VQ loss는 input과 quantized value 간의 similarity를 restrict 하기 위해 사용됨:
    (Eq. 12) $\mathcal{L}_{vq}=\lambda_{commit}\mathcal{L}_{commit}+\lambda_{codebook}\mathcal{L}_{codebook}$
    - 여기서 논문은 EnCodec을 따라 commitment loss와 codebook loss를 도입함
- 결과적으로 얻어지는 reconstructed waveform에 대한 loss는:
  (Eq. 13) $\mathcal{L}=\lambda_{rec}\mathcal{L}_{rec}+\lambda_{adv}\mathcal{L}_{adv}+\lambda_{feat}\mathcal{L}_{feat}+\lambda_{vq}\mathcal{L}_{vq}$
- SPCodec의 total loss는 $\hat{x},\hat{x}_{l}$ 모두에 적용됨:
  (Eq. 14) $\mathcal{L}_{SPCodec}=\mathcal{L}_{\hat{x}}+\mathcal{L}_{\hat{x}_{l}}$
  - 이때 $\hat{x}_{l}$에 대한 explicit supervision은 embedding $e_{l}$이 low-frequency-related feature만 contain 할 수 있도록 함
  - $\lambda_{spec}=\lambda_{mel}=15, \lambda_{feat}=2, \lambda_{commit}=0.25$, 그 외의 hyperparameter는 $1.0$으로 설정됨

3. Experiments

- Settings

Dataset : VCTK, LibriTTS
Comparisons : SoundStream, DAC, HiFi-Codec, SRCodec

- Results

전체적으로 SPCodec의 성능이 가장 우수함

Bitrate 별 MOS 측면에서도 SPCodec이 가장 뛰어남

Ablation Study
- SPModule은 우수한 low+high frequency reconstruction 성능을 보임

Low-frequency에 대해서도 SPCodec의 reconstruction 성능이 가장 뛰어남

'Paper > Neural Codec' 카테고리의 다른 글

[Paper 리뷰] SpecTokenizer: A Lightweight Streaming Codec in the Compressed Spectrum Domain (0)	2025.09.09
[Paper 리뷰] FreeCodec: A Disentangled Neural Speech Codec with Fewer Tokens (0)	2025.09.03
[Paper 리뷰] TS3-Codec: Transformer-based Simple Streaming Single Codec (0)	2025.08.22
[Paper 리뷰] Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model (0)	2025.08.02
[Paper 리뷰] DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech Codec (0)	2025.07.21

최근에 올라온 글

최근에 달린 댓글

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] SPCodec: Split and Prediction for Neural Speech Codec

SPCodec: Split and Prediction for Neural Speech Codec

1. Introduction

2. Method

- SPCodec Framework

- Latent Split and Prediction (SP)

- Training Paradigm

3. Experiments

- Settings

- Results

'Paper > Neural Codec' 카테고리의 다른 글

티스토리툴바