[Paper 리뷰] EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding

티스토리 뷰

Paper/Neural Codec

[Paper 리뷰] EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding

feVeRin 2026. 5. 20. 12:57

EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding

Spectrogram-domain은 complex-valued phase modeling의 한계가 있음
EuleroDec
- Analysis-Quantization-Synthesis pipeline에서 magnitude-phase coupling을 preserve
- 특히 adversarial discriminator, diffusion post-filter를 제거하여 end-to-end processing을 지원
논문 (ICASSP 2026) : Paper Link

1. Introduction

Spectral-domain audio codec은 STFT를 통해 signal을 time-frequency domain으로 decompose 함
- Magnitude spectrum 만으로도 대부분의 perceptual content를 반영할 수 있지만, phase spectrum이 improper 한 경우 decoded signal에서 audible artifact가 발생함
- 이를 해결하기 위해 EnCodec, DAC 등은 multi-scale adversarial discriminator를 사용하고, ScoreDec은 flow-based post filter를 사용함
  - BUT, 해당 방식은 slow convergence와 adversarial instability의 문제점이 있음
- 이때 Complex-Valued Neural Network (CVNN)을 활용하면 speech modeling을 향상할 수 있음

-> 그래서 CVNN을 end-to-end neural codec에 접목한 EuleroDec을 제안

EuleroDec
- Complex-valued RVQ-VAE를 활용해 STFT의 algebraic structure와 amplitude-phase coupling을 반영
- 추가적으로 adversarial training, diffusion-based post-filter를 제거하여 end-to-end modeling을 지원

< Overall of EuleroDec >

Complex-valued RVQ-VAE를 활용한 end-to-end neural codec
결과적으로 기존보다 우수한 성능을 달성

2. Background

- Residual Vector Quantization

RVQ-VAE-based codec에서 encoder는 complex STFT frame을 latent vector $\mathbf{z}\in\mathbb{R}^{H}$로 project 함
- $M$ residual codebook stack $\{\mathcal{E}^{(m)}\}_{m=1}^{M}$은 $\mathbf{z}$를 iteratively approximate 함:
  (Eq. 1) $ \mathbf{r}^{(m)}=\mathbf{z}-\sum_{j<m}\mathbf{e}_{k_{j}}^{(j)},\,\,\, k_{m}=\arg\min_{k}|| \mathbf{r}^{(m)}-\mathbf{e}_{k}^{(m)}||_{2}$
  - $\mathbf{e}_{k}^{(m)}$ : stage $m$의 selected centroid
  - 이때 index sequence $(k_{1},...,k_{M})$만 transmit 되어 $R_{f}M\log_{2}K$의 bitrate를 제공함
- 기존 spectral codec은 quantization 전에 complex STFT를 두 개의 real signal로 split 해야 함:
  1. 이를 위해 spectrum을 modulus $|X|$와 unwrapped phase $\angle X$로 split 하여 2개의 independent-RVQ pipeline을 training 할 수 있음
  2. 한편으로는 spectrum을 real/imaginary component $\mathfrak{R}\{X\}, \mathfrak{I}\{X\}$로 separate 하고 Euclidean Mean-Squared Error로 optimize 된 RVQ cascade에 전달할 수 있음
    - 이후 output을 $\hat{X}=\hat{R}+j\hat{I}$와 같이 recombine 하여 iSTFT에 전달함
- BUT, 앞선 두 방식 모두 magnitude, phase 간의 intrinsic correlation을 neglect 한다는 단점이 있음

- Complex-Valued Neural Networks

Complex-valued nerual network는 input, weight, activation을 $z=x+iy$와 같이 represent 하고 true complex algebra를 compute 함
- Complex convolution은 $\mathbb{C}$에 대해 linear 하고 real/imaginary part를 coupling 함:
  (Eq. 2) $ (w*z)[n]=\sum_{k}w_{k}z_{n-k},\,\,\, w_{k}=a_{k}+ib_{k}, \,\,\,z_{n-k}=x_{n-k}+iy_{n-k}$
  - 해당 coupling은 model이 $x,y$를 independent channel로 취급하지 않고, 대신 amplitude-phase interaction을 학습할 수 있도록 함
- 여기서 phase equivariance는 임의의 $\phi\in\mathbb{R}$에 대해:
  (Eq. 3) $f\left(e^{i\phi}z\right)=e^{i\phi}f(z)$
  - 이는 $U(1)$ rotation으로 induce 된 geometry를 preserve 함
- $\text{modReLU}$ activation은 해당 phase intact를 leave 하고 modulus에 threshold를 적용함:
  (Eq. 4) $\text{modReLU}(z)=\text{ReLU}(|z|+b)\frac{z}{|z|}$
- Normalization은 separate modeling 대신 $(x,y)$를 whitening 하여 $2\times 2$ covariance와 함께 cross-channel dependence를 modeling 함

3. Method

논문은 $x\in\mathbb{C}^{B\times C\times F\times T}$의 $\texttt{complex64}$ domain을 활용함
- 먼저 24kHz에서 $N_{FFT}=512, \text{win}=512, \text{hop}=64, \texttt{Hann window}$를 사용하여 256 frame에 대한 complex spectrogram을 compute 하고, 2048-entry codebook의 RVQ를 사용함
  1. 6 kpbs에서는 8 temporal stride를 통해 256 frame을 32 latent frame으로 reduce 하고, fixed length coding과 12 codebook을 사용하는 경우 6.2 kpbs가 됨
  2. 12 kpbs에서는 4 temporal stride를 사용하여 token rate를 doubling 하고 동일한 codebook 수를 keeping 하여 $\approx$ 12.4 kpbs를 얻음
- 구조적으로는 complex-valued VQ-VAE를 사용함

- Encoder and Decoder

4 downsampling stage는 $\text{freq}\times \text{time}$의 anisotropic schedule을 사용하고, decoder는 transposed convolution을 통해 이를 mirror 함
- Encoder는 5 complex residual layer를 가지고 $((1,1), (3,3), (3,5), (3,7), (1,1))$ dilation을 사용하여 stable complex statistics를 maintain 하면서 receptive field를 enlarge 함
  1. 이후 hierarchical compression을 위한 complex $3\times 7$ convolution과 4 downsampling을 적용함
  2. 각 stage에서 gated skip branch는 input에 대한 adaptive complex average pooling을 compute 하고 $1\times 1$ complex projection을 적용함
  3. 해당 branch는 complex downsampling, normalization, $3\times 3$ complex convolution, complex axial self-attention, $1\times 1$ complex projection 등을 가지는 main path와 summation 됨
    - Strided branch는 drop-path probability $p=0.05$로 summation 됨
- 이때 논문은 encoder에서 2D spectrogram structure를 keep 하여 spatial relation을 retain 함
  - Decoder는 pooling branch 없이 해당 mechanism을 mirror 하고 frequency-axis attention, complex feed-forward block에 4 upsampling stage을 적용하여 full-resolution complex spectrogram을 restore 함

- Vector Quantizer

Quantization 역시 complex domain에서 수행됨
- Encoder output $z_{e}\in\mathbb{C}^{B\times C\times F\times T}$는 frequency를 channel로 collapse 하여 reshape 되고 $z_{e}^{\flat}\in\mathbb{C}^{B\times (C\cdot F)\times T}$를 생성함
  - Complex lienar projection $W_{in}\in\mathbb{C}^{D\times (C\cdot F)}$는 merged representation을 code dimension으로 mapping 하고, 이후 $S$ stage Residual Vector Quantizer가 적용됨
- Codebook은 current continuous encoder embedding에서 centroid seed를 sampling 하고 small complex Gaussian noise를 add 하여 30 optimization warm-up step 이후에 initialize 됨
  1. 각 stage에서 모든 time index에 대해, vector quantization은 Hermitian-induced Euclidean metric 하에서 nearest complex centroid를 select 함
  2. 즉, $\mathcal{E}=\{e_{k}\}_{k=1}^{K}\subset \mathbb{C}^{D}$, $x\in\mathbb{C}^{D}$에 대해:
    (Eq. 5) $ d_{k}(x)=||x||_{2}^{2}+||e_{k}||_{2}^{2}-2\text{Re}\left(x^{H}e_{k}\right)$
    (Eq. 6) $k^{*}(x)=\arg\min_{k}d_{k}(x)$
- Stage output은 quantized reconstruction을 accumulate 하고 next stage에 대한 residual을 update 함
  1. Encoder stability는 $z_{e}$를 assigned centroid로 pull 하는 commitment loss를 통해 promote 됨:
    (Eq. 7) $\mathcal{L}_{commit}=\beta\frac{1}{N}\sum_{n=1}^{N}\left|\left| z_{e,n}-\text{sg}\left( e_{k*(n)}\right)\right|\right|_{2}^{2}$
    - $\text{sg}$ : stop-gradient operation
  2. Codebook은 assignment count와 feature sum의 exponential moving average로 update 됨
  3. Last stage 후, complex linear map $W_{out}\in\mathbb{C}^{(C\cdot F)\times D}$는 project back 되고 decoding을 위해 frequency를 unmerge 하여 $z_{q}\in\mathbb{C}^{B\times C\times F\times T}$를 recover 함
- 추가적으로 논문은 per-code usage $u_{k}$를 tracking 하여 $u_{k}\leq \tau$인 code를 dead로 flag 함
  1. 각 dead code에 대해 probability $p_{refresh}=0.015$로 current mini-batch에서 randomly sampled feature $x_{i}$를 re-seed 하고 small complex Gaussian noise $\epsilon\sim\mathcal{CN}(0,\sigma^{2}I)$를 add 함
    - $\sigma=0.001$
  2. 이후 $e_{k}\leftarrow x_{i}+\epsilon$으로 설정하고 EMA buffer를 $\bar{e}_{k}\leftarrow e_{k}$로 synchronize 하고 immediate re-pruning을 방지하기 위해 $u_{k}\leftarrow \tau+1$을 설정함

4. Experiments

- Settings

Dataset : LibriTTS
Comparisons : AudioDec, EnCodec, APCodec

- Results

전체적으로 EuleroDec의 성능이 가장 우수함

Ablation Study
- 각 component는 성능 향상에 유효함

Complex valued AE를 사용하면 최적의 결과를 얻을 수 있음

'Paper > Neural Codec' 카테고리의 다른 글

[Paper 리뷰] Self-Guidance: Enhancing Neural Codecs via Decoder Manifold Alignment (0)	2026.07.21
[Paper 리뷰] SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization (0)	2026.07.01
[Paper 리뷰] VoCodec: An Efficient Lightweight Low-Bitrate Speech Codec (0)	2026.05.18
[Paper 리뷰] IBPCodec: A Low-Bitrate Lightweight Speech Codec with Inter-Band Prediction (0)	2026.05.13
[Paper 리뷰] STACodec: Semantic Token Assignment for Balancing Acoustic Fidelity and Semantic Information in Audio Codecs (0)	2026.05.07

최근에 올라온 글

최근에 달린 댓글

« 2026/07 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding

EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding

1. Introduction

2. Background

- Residual Vector Quantization

- Complex-Valued Neural Networks

3. Method

- Encoder and Decoder

- Vector Quantizer

4. Experiments

- Settings

- Results

'Paper > Neural Codec' 카테고리의 다른 글

티스토리툴바

티스토리 뷰

[Paper 리뷰] EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding

document.addEventListener("DOMContentLoaded", function() { renderMathInElement(document.body, { delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false} ] });});

EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding

1. Introduction

2. Background

- Residual Vector Quantization

- Complex-Valued Neural Networks

3. Method

- Encoder and Decoder

- Vector Quantizer

4. Experiments

- Settings

- Results

'Paper > Neural Codec' 카테고리의 다른 글

티스토리툴바