[Paper 리뷰] ComplexDec: A Domain-Robust High-Fidelity Neural Audio Codec with Complex Spectrum Modeling

티스토리 뷰

Paper/Neural Codec

[Paper 리뷰] ComplexDec: A Domain-Robust High-Fidelity Neural Audio Codec with Complex Spectrum Modeling

feVeRin 2025. 3. 27. 20:17

ComplexDec: A Domain-Robust High-Fidelity Neural Audio Codec with Complex Spectrum Modeling

기존의 neural audio codec은 out-of-domain audio를 modeling 하는데 어려움이 있음
ComplexDec
- Out-of-Domain robustness는 codec compression으로 인한 information loss로 인해 발생
- 24kbps bitrate에서 해당 information loss를 완화하기 위해 complex spectral input/output을 활용
논문 (ICASSP 2025) : Paper Link

1. Introduction

Digital Signal Processing (DSP)-based audio codec은 audio quality를 sacrifice 하여 aggressive compression을 수행함
- 최근에는 SoundStream, EnCodec, AudioDec과 같이 Residual Vector Quantizer (RVQ) architecture를 가지는 AutoEncoder (AE)를 활용한 neural codec을 사용해 discrete audio token을 얻음
  - BUT, 해당 data-driven neural codec은 DSP-codec에 비해 unseen audio에 vulnerable 하고 downstream audio generation에 대한 error propagation이 발생함
- Out-of-Domain robustness를 개선하기 위해서는 temporal/dimensional compression으로 인한 information loss 문제를 해결해야 함
  - 일반적인 encoder는 high-temporal resolution waveform을 low-temporal resolution, high-dimensional space로 project 하고, codebook learning stability를 위해 embedding dimension도 함께 compress 하기 때문

-> 그래서 temporal/dimensional compression 문제를 완화할 수 있는 neural codec인 ComplexDec을 제안

ComplexDec
- Complex spectral domain을 활용하여 speech coding을 수행
- 150Hz와 같은 complex spectra의 low-temporal resolution을 통해 down/upsampling layer 없는 fully convolutional architecture를 활용하여 temporal compression을 bypass
  - 이때 256-dimension의 low-spectral dimension setting을 통해 dimensionality reduction도 avoid 함
- 추가적으로 STFT를 통해 complex spectra를 추출하여 information loss를 완화

< Overall of ComplexDec >

Complex spectral domain을 기반으로 temporal/dimensional compression을 avoid 한 neural codec
결과적으로 information loss를 줄여 out-of-domain data에 대해 기존보다 뛰어난 성능을 달성

2. Method

- Problem Formulation

150Hz frame rate와 48kHz speech coding을 위해 24kpbs ($150\times 16\times 10$) 에서 동작하는 16개의 10-bit codebook을 가지는 standard waveform-domain RVQAE codec이 주어졌을 때,
- Compression ratio는 $\frac{48000}{150\times H}$과 같음
  - $H$ : code dimension
- Bitrate는 fixed codebook size가 주어졌을 때 code의 temporal resolution (frame rate)에만 relate 되므로, high $H$를 사용해 same bitrate를 유지하고 compression ratio를 reduce 하여 information loss를 완화할 수 있음
  1. BUT, AudioDec과 같은 standard neural codec은 unstable codebook learning을 avoid 하기 위해 여전히 code dimension을 compress 하므로 information loss가 발생함
  2. DAC의 경우, handcraft code factorization을 채택하여 high-dimensional code embedding에 대한 information loss를 완화하고, low-dimensional code lookup을 통해 codebook learning stability를 maintain 함
    - BUT, high-dimensional representation은 audio modeling difficulty와 memory requirement가 증가하므로 regression-based audio generation에 적합하지 않음

- Model Overview

ComplexDec은 information loss를 완화하고 high-dimension issue를 avoid 하는 것을 목표로 함
- 먼저 320 hop length, 510 STFT size, Hann window를 사용한 complex spectral domain에서 audio를 coding 하여 150Hz frame rate와 256 embedding dimension을 얻음
- 24kbps bitrate의 ComplexDec은 16개의 10-bit codebook을 (8-real/8-imaginary) 사용한 down/upsampling layer가 없는 RVQAE architecture를 활용함
  1. 64-dimensional code를 사용하는 waveform-based AudioDec과 비교하여, ComplexDec은 5 compression ratio의 AudioDec보다 작은 1.25 compression ratio를 사용함
  2. 0.3125 compression ratio와 1024-dimensional code를 사용하는 24kbps DAC와 비교하여, ComplexDec은 256-dimensional code를 사용함

- Model Architecture

ComplexDec은 real/imaginary spectra에 대한 2개의 RVQAE로 구성됨
- Encoder/decoder는 share 되지만 reconstruction quality를 위해 codebook은 independent 하게 구성됨
  - 추가적으로 decoded complex spectra를 refine 하기 위해 ScoreDec의 Score-based Post-Filter (SPF)를 채택함
- Encoder는 kernel size 7인 1-dimensional convolution layer (Conv1D), 4개의 encoder block, kernel size 3의 additional Conv1D로 구성됨
  1. 모든 Conv1D의 channel 수는 256으로 설정됨
  2. 각 encoder block은 kernel size 2인 Conv1D와 3개의 residual unit으로 구성됨
    - Residual unit은 residual connection을 포함한 2개의 Exponential Linear Unit (ELU)-dilated Conv1D로 구성됨
    - Dilation은 $[1,3,9]$, kernel size는 7을 사용
  3. Decoder는 mirrored architecture를 사용하는 대신, dilated Conv1D를 transpose Conv1D로 replace 하여 사용함
- RVQAE는 spectral loss $\mathcal{L}_{MSE}, \mathcal{L}_{MAE}$, multi-resolution mel-spectral loss $\mathcal{L}_{Mel}$, commitment VQ loss $\mathcal{L}_{VQ}$로 training 되고, codebook은 encoded/residual representation의 Exponential Moving Average (EMA)를 사용하여 update 됨
  1. 먼저, input spectrum $x=x_{r}+ix_{i}$와 reconstructed spectrum $\hat{x}=\hat{x}_{r}+i\hat{x}_{i}$가 주어졌을 때, $\mathcal{L}_{MSE}$ loss는:
    (Eq. 1) $\mathcal{L}_{MSE}=\mathbb{E}\left[\frac{||x_{r}-\hat{x}_{r}||+|| x_{i}-\hat{x}_{i}||_{2}}{2}\right]$
  2. 다음으로 $\mathcal{L}_{MAE}$ loss는:
    (Eq. 2) $\mathcal{L}_{MAE}=\mathbb{E}\left[||x-\hat{x}||_{1}\right]$
- $[50,120,240]$ hop length와 $[512, 1024, 2048]$ STFT size로 추출된 80-dimensional mel-spectra를 사용하여 $\mathcal{L}_{Mel}$을 계산하기 위해, 논문은 additional inverse STFT module을 도입함
  1. 이때 SPF는 ScoreDec을 따라 score matching을 활용하여 training 됨
  2. 즉, natural, coded complex spectral pair $(x_{0},\hat{x})$가 주어졌을 때 SPF의 Ornstein-Uhlenbeck Variance Exploding (OUVE) stochastic forward process는:
    (Eq. 3) $\text{d}x_{t}=\underset{:=f(x_{t},\hat{x})}{\underbrace{\gamma(\hat{x}-x_{t})}}\text{d}t+ \underset{:=g(t)}{\underbrace{\left[\sigma_{\min}\left(\frac{\sigma_{\max}}{\sigma_{\min}}\right)^{t}\sqrt{2\log \left(\frac{\sigma_{\max}}{\sigma_{\min}}\right)}\right]}} \text{d}\mathbf{w}$
    - $t\in[0,T]$ : diffusion time step, $\mathbf{w}$ : standard Wiener process
    - $f(x_{t},\hat{x})$ : drift function, $g(t)$ : diffusion coefficient
    - $\gamma, (\sigma_{\min}, \sigma_{\max})$ : constant hyperparameter
  3. Score function $\nabla_{x_{t}}\log p_{t}(x_{t})$를 $\mathbf{s}$라 하고, time-reversed Wiener process $\bar{\mathbf{w}}$가 주어졌을 때 reverse Stochastic Differential Equation (SDE)는:
    (Eq. 4) $\text{d}x_{t}=\left[-f(x_{t},\hat{x})+g(t)^{2}\mathbf{s}\right]\text{d}t+g(t)\text{d}\bar{\mathbf{w}}$
  4. (Eq. 3)의 Gaussian process에서 $x_{t}$는 mean $\mu(x_{0},\hat{x},t)$, variance $\sigma(t)^{2}$를 가지는 normal distribution을 따르므로, $x_{t}$는 sampled Gaussian noise $\mathbf{z}$로부터 얻어질 수 있음:
    (Eq. 5) $x_{t}=\mu(x_{0},\hat{x},t)+\sigma(t)\mathbf{z}$
  5. 그러면 score function $\mathbf{s}$는:
    (Eq. 6) $\nabla_{x_{t}}\log p_{t}(x_{t}|x_{0},\hat{x})=-\frac{x_{t}-\mu(x_{0},\hat{x},t)}{\sigma(t)^{2}}$
  6. 이는 score-matching objective로 training 된 neural network $\mathbf{s}_{\theta}$로 추정될 수 있음:
    (Eq. 7) $\arg\min_{\theta}\mathbb{E}_{x_{t}|(x_{0},\hat{x}),\hat{x},\mathbf{z},t}\left[\left|\left| \mathbf{s}_{\theta}(x_{t},\hat{x},t)+\frac{\mathbf{z}}{\sigma(t)}\right|\right|_{2}^{2}\right]$
- Well-trained $\mathbf{s}_{\theta}$가 주어지면 SPF는 reverse diffusion sampling을 predictor로 사용하는 predictor-corrector sampler와 reverse SDE에 대한 corrector로 annealed Langevin Dynamics를 채택함
  - 이때 corrector에 대한 SNR parameter는 0.5, reverse step은 30을 사용
  - 추가적으로 diffusion process의 spectra는 다음과 같이 modulate/demodulate 됨:
    (Eq. 8) $x'=\beta|x|^{\alpha}e^{i\angle(x)}$
    (Eq. 9) $x=\beta^{-1}|x'|^{\frac{1}{\alpha}}e^{i\angle(x')}$
    - $\angle(\cdot)$ : complex number의 angle
    - $\alpha=0.5$ : amplitude companding constant
    - $\beta=0.15$ : amplitude를 $[0,1]$로 normalize 하기 위한 scaling constant
- SPF score model은 U-Net style Noise Conditional Score Network (NCSSN++)를 채택하고 real/imaginary spectra를 2-channel로 사용함
  1. Complex spectra $x_{0},\hat{x}$의 4-channel $256\times 256$ input이 주어지면 NCSSN++는 먼저 feature map을 lower resolution, higher channel 수를 가진 space로 gradually project 함
  2. 이후 mirrored manner로 feature map을 $256\times 256$으로 project back 함
    - 추가적으로 skip connection과 progressive conditional structure도 integrate 되고 diffusion time step은 Fourier embedding method를 통해 network에 incorporate 됨

3. Experiments

- Settings

Dataset : VCTK, EARS
Comparisons : AudioDec, ScoreDec, EnCodec, DAC

- Results

전체적으로 ComplexDec의 성능이 가장 우수함

MOS 측면에서도 ComplexDec이 가장 뛰어남

Discussion
- Magnitude spectrogram을 비교해 보면 AudioDec은 harmonic structure를 reconstruct 하지 못해 blurry spectrogram을 생성함
- ScoreDec의 경우, diffusion nature로 인해 missing harmonics를 recover 하지 못함
- 반면 ComplexDec은 6kHz 이하의 harmonic structure를 well-preserve 함

'Paper > Neural Codec' 카테고리의 다른 글

[Paper 리뷰] FlowDec: A Flow-Based Full-Band General Audio Codec with High Perceptual Quality (0)	2025.04.12
[Paper 리뷰] FunCodec: A Fundamental, Reproducible and Integrable Open-Source Toolkit for Neural Speech Codec (0)	2025.04.08
[Paper 리뷰] RepCodec: A Speech Representation Codec for Speech Tokenization (0)	2025.02.22
[Paper 리뷰] Generative De-quantization for Neural Speech Codec via Latent Diffusion (0)	2024.07.18
[Paper 리뷰] Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation (0)	2024.06.23

최근에 올라온 글

최근에 달린 댓글

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] ComplexDec: A Domain-Robust High-Fidelity Neural Audio Codec with Complex Spectrum Modeling

ComplexDec: A Domain-Robust High-Fidelity Neural Audio Codec with Complex Spectrum Modeling

1. Introduction

2. Method

- Problem Formulation

- Model Overview

- Model Architecture

3. Experiments

- Settings

- Results

'Paper > Neural Codec' 카테고리의 다른 글

티스토리툴바