[Paper 리뷰] ComVo: Toward Complex-Valued Neural Networks for Waveform Generation

티스토리 뷰

Paper/Vocoder

[Paper 리뷰] ComVo: Toward Complex-Valued Neural Networks for Waveform Generation

feVeRin 2026. 4. 7. 13:03

ComVo: Toward Complex-Valued Neural Networks for Waveform Generation

iSTFT-based vocoder는 complex spectrogram의 inherent structure를 capture 하기 어려움
ComVo
- Generator, discriminator에서 native complex arithmetic을 사용하여 complex-valued representation에 대한 structured feedback을 제공
- Phase quantization을 도입하여 phase value를 discretize 하고 training process를 regularize
- 추가적으로 block-matrix computation을 통해 training efficiency를 향상
논문 (ICLR 2026) : Paper Link

1. Introduction

기존의 Generative Adversarial Network (GAN)-based, flow-based, diffusion-based neural vocoder는 complexity와 inference latency의 한계가 존재함
- 이를 위해 iSTFTNet, Vocos, RFWave와 같이 inverse Short-Time Fourier Transform (iSTFT)를 활용할 수 있음
- BUT, 대부분의 iSTFT-based vocoder는 real/imaginary part를 separate channel로 process 하는 Real-Valued Neural Network (RVNN)에 의존함
  - 해당 separation으로 인해 real/imaginary component 간의 coupling을 효과적으로 modeling 하지 못함
- 한편 Complex-Valued Neural Network (CVNN)은 input, parameter가 모두 complex-value를 가지게 하여 standard neural network를 complex domain으로 extend 할 수 있음
  - 특히 CVNN은 complex coefficient를 jointly process하여 RVNN이 miss 하는 cross-component interaction을 modeling 할 수 있음

-> 그래서 CVNN을 iSTFT-based vocoder에 적용한 ComVo를 제안

ComVo
- CVNN layer 기반의 generator를 통해 spectrogram의 real/imaginary part를 jointly modeling 하고 해당 complex spectrogram에 대한 complex Multi-Resolution Discriminator (cMRD)를 적용
- Training stability를 위한 inductive bias로써 Phase Quantization을 도입하고 complex-valued operation의 redundant computation을 줄이기 위해 Block-Matrix Computation을 채택

< Overall of ComVo >

CVNN, phase quantization을 활용한 complex-valued iSTFT-based neural vocoder
결과적으로 기존보다 우수한 성능을 달성

2. Preliminary Analysis of Real- and Complex-Valued Networks

Complex-valued network를 complex field에서 directly operate 하면 magnitude, phase 간의 interaction을 효과적으로 capture 할 수 있음
- 이를 위해 논문은 다음 두 model을 비교함:
  1. RVNN : Complex number를 2개의 real channel로 represent 함
  2. CVNN : 각 coefficient를 single complex entity로 process 함
- 결과적으로 RVNN, CVNN 모두 target distribution의 broad structure를 recover 할 수 있지만, CVNN은 underlying trajectory에 closely adhere 한 sample을 생성하고 magnitude, phase 모두에서 더 낮은 JSD를 보임
  - 즉, data가 inherent real-imaginary dependency를 가질 때 complex domain에서 directly modeling 하면 representational advantage를 확보할 수 있음

3. Method

ComVo는 iSTFT-based GAN vocoder로써 generator/discriminator를 complex domain에서 operate 하여 real-imaginary interaction을 end-to-end preserving 함
- 이를 위해 adversarial training objective를 포함한 iSTFT synthesis pipeline을 기반으로, inductive bias를 위한 phase quantization layer, complex-valued computation을 위한 block-matrix formulation을 적용함

Generated/Ground-Truth Magnitude 간의 Jensen-Shannon Divergence (JSD)

- Generator

논문은 frame-level iSTFT vocoder인 Vocos architecture를 기반으로 함
- 먼저 generator에서 모든 convolution, normalization은 complex-domain에서 동작함
  - 이때 complex setting에서 ConvNeXt-style block layout을 maintain 하기 위해 split GELU activation을 적용함
- Initial complex convolution 후, phase quantization layer는 phase value를 discretize 함

- Discriminator

Complex Multi-Resolution Discriminator (cMRD)는 complex spectrogram input을 기반으로 서로 다른 STFT resolution에서 동작하는 mulitple sub-discriminator로 구성됨
- Training 시에는 real/imaginary part에 각각 adversarial loss를 적용하고, 서로 다른 period의 multiple discriminator로 구성된 HiFi-GAN의 Multi-Period Discriminator (MPD)를 통해 reshaped waveform segment를 process 함
  - 이때 MPD는 waveform level에서 동작하므로 real-valued network를 사용함
- 결과적으로 overall training objective는 feature matching loss, reconstruction loss, cMRD와 MPD의 adversarial loss를 combine 하여 얻어짐

- Phase Quantization Layer

Complex-valued network의 nonlinearity는 real/imaginary component를 jointly handle 할 수 있어야 함
- 따라서 논문은 각 mel-spectrogram의 imaginary part를 0으로 initialize 하여 complex value로 represent 하고, phase angle을 fixed level set으로 discretize 하는 phase quantization layer를 도입함
  - 이를 통해 relative phase relationship을 preserve 하고 training 시 phase drift를 mitigate 하는 structured nonlinearity를 제공할 수 있음
- Complex feature $z=re^{i\theta}$에 대해 quantized phase는:
  (Eq. 1) $ \theta_{q}=\frac{2\pi}{N_{q}}\cdot\text{round}\left(\frac{N_{q}}{2\pi}\theta\right)$
  - $r\geq 0$ : magnitude, $\theta\in(-\pi,\pi]$ : principal phase, $N_{q}$ : quantization level 수
- 그러면 quantized complex value는 다음과 같이 reconstruct 됨:
  (Eq. 2) $z_{q}=re^{i\theta_{q}}$
- Continuous angle을 fixed level set에 mapping 하여 phase를 quantize 하면 gradient propagation을 block 하는 inherent discontinuity가 발생함
  1. 따라서 논문은 end-to-end differentiability를 위해 Straight-Through Estimator (STE)를 도입하여 phase quantization layer를 통한 gradient propagation을 preserve 하고 optimization stability를 향상함
  2. 추가적으로 phase value를 discrete set으로 restrict 하여 regularization과 같이 사용함
    - 이를 통해 intermediate representation에서 unwarranted phase variability를 limit 하고 network가 coherent, structured phase pattern을 학습하도록 guide 할 수 있음

- Optimizing Complex Computation with Block Matrices

Forward/backward pass 모두에서 efficiency를 향상하기 위해 CVNN operation을 real-valued block-matrix multiplication으로 reformulate 함
- 기존 auto-differentiation system에서 complex-valued layer는 real/imaginary component를 separate real-valued tensor로 explicitly tracking 함
  - 이로 인해 forward/backward pass에서 redundant operation과 inefficient memory access가 발생함
- 따라서 논문은 complex value를 real value의 structured pair로 represent 하고 unified matrix operation을 통해 jointly process 하는 block-wise formulation을 도입함
  1. 먼저 forward complex operation은:
    (Eq. 3) $\begin{bmatrix} \text{Re}(z') \\ \text{Im}(z') \end{bmatrix} =\begin{bmatrix} W_{r} & -W_{i} \\ W_{i} & W_{r} \\ \end{bmatrix}\begin{bmatrix} x \\ y \end{bmatrix}$
    - $z=x+iy$, 이때 $x,y$는 각각 real/imaginary input vector,
    - $W=W_{r}+iW_{i}$ : complex weight matrix, 이때 $W_{r}, W_{i}$는 real/imaginary part
    - $z'$ : resulting complex output
  2. Backward gradient computation은:
    (Eq. 4) $\begin{bmatrix} \frac{\partial L}{\partial x} \\ \frac{\partial L}{\partial y} \end{bmatrix} =\begin{bmatrix} W_{r} & -W_{i} \\ W_{i} & W_{r} \\ \end{bmatrix}^{\top}\begin{bmatrix} g_{r} \\ g_{i} \end{bmatrix}$
    - $g_{r},g_{i}$ : next layer의 gradient에 대한 real/imaginary component
  3. 해당 block-wise multiplication은 4개의 independent real-valued multiply를 single block-matrix multiply로 replace 하여 redundant computation을 eliminate 하고 efficient gradient evaluation을 지원함

4. Experiments

- Settings

Dataset : LibriTTS
Comparisons : HiFi-GAN, iSTFTNet, BigVGAN, Vocos

- Results

전체적으로 ComVo의 성능이 가장 우수함

MUSDB18 dataset에 대해서도 뛰어난 성능을 보임

Subjective evaluation 측면에서도 ComVo가 가장 뛰어남

Complex-Valued Modeling
- cMRD를 사용하면 모든 sub-discriminator에서 structured spectral pattern을 consistently trace 할 수 있음

Complex-valued generator, complex-valued discriminator $G_{C}D_{C}$ 조합을 사용할 때 최적의 성능을 달성함

Phase Quantization
- Quantization level $N_{q}=128$일 때 최적의 trade-off를 보임

Block-Matrix Computation Scheme
- Naive implementation과 비교하여 block-matrix를 사용하면 더 나은 성능을 달성할 수 있음

'Paper > Vocoder' 카테고리의 다른 글

[Paper 리뷰] GLA-Grad++: An Improved Griffin-Lim Guided Diffusion Model for Speech Synthesis (0)	2026.05.04
[Paper 리뷰] Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-Step High-Fidelity Audio Generation (0)	2026.04.27
[Paper 리뷰] DegVoC: Revisiting Neural Vocoder from a Degradation Perspective (0)	2026.03.30
[Paper 리뷰] WaveNeXt2: ConvNeXt-based Fast Neural Vocoders with Residual Denoising and Sub-Modeling for GAN and Diffusion Models (0)	2026.03.16
[Paper 리뷰] Wave-Trainer-Fit: Neural Vocoder with Trainable Prior and Fixed-Point Iteration Towards High-Quality Speech Generation from SSL Features (0)	2026.03.04

최근에 올라온 글

최근에 달린 댓글

« 2026/07 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] ComVo: Toward Complex-Valued Neural Networks for Waveform Generation

ComVo: Toward Complex-Valued Neural Networks for Waveform Generation

1. Introduction

2. Preliminary Analysis of Real- and Complex-Valued Networks

3. Method

- Generator

- Discriminator

- Phase Quantization Layer

- Optimizing Complex Computation with Block Matrices

4. Experiments

- Settings

- Results

'Paper > Vocoder' 카테고리의 다른 글

티스토리툴바

티스토리 뷰

[Paper 리뷰] ComVo: Toward Complex-Valued Neural Networks for Waveform Generation

document.addEventListener("DOMContentLoaded", function() { renderMathInElement(document.body, { delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false} ] });});

ComVo: Toward Complex-Valued Neural Networks for Waveform Generation

1. Introduction

2. Preliminary Analysis of Real- and Complex-Valued Networks

3. Method

- Generator

- Discriminator

- Phase Quantization Layer

- Optimizing Complex Computation with Block Matrices

4. Experiments

- Settings

- Results

'Paper > Vocoder' 카테고리의 다른 글

티스토리툴바