[Paper 리뷰] DegVoC: Revisiting Neural Vocoder from a Degradation Perspective

티스토리 뷰

Paper/Vocoder

[Paper 리뷰] DegVoC: Revisiting Neural Vocoder from a Degradation Perspective

feVeRin 2026. 3. 30. 13:05

DegVoC: Revisiting Neural Vocoder from a Degradation Perspective

기존의 neural vocoder는 performance-cost trade-off가 존재함
DegVoC
- Mel-spectrogram을 target spectrum으로부터의 signal degradation process로 취급
- Degradation prior를 활용하여 simple linear transformation을 통해 initial spectral structure를 retrieve 하고 time-frequency domain에서 heterogeneous distribution을 고려한 deep prior solver를 도입
논문 (AAAI 2026) : Paper Link

1. Introduction

Vocoder는 acoustic feature로부터 target time-domain waveform을 reconstruct 함
- 특히 Multi-band MelGAN, HiFi-GAN과 같은 Generative Adversarial Network (GAN)-based vocoder를 활용하면 우수한 quality를 달성할 수 있음
  - 그 외에 RFWave, WaveFM과 같은 diffusion-based vocoder를 활용할 수도 있음
- BUT, neural vocoder는 여전히 performance-cost dilemma가 존재함
  1. Vocoder task는 Mel-sepctrum에 대한 conditional generation task로 formulate 되므로 accurate acoustic modeling을 위해서는 large network parameter가 필요함
  2. Diffusion model의 경우 multiple reverse sampling step으로 인한 computational cost가 존재함

-> 그래서 기존 vocoder의 performance-cost trade-off를 개선한 DegVoC를 제안

DegVoC
- Mel modeling을 signal degradation 측면에서 classical signal retrieval task로 reformulate
- Uneven sub-band division/merging strategy와 long-term temporal-spectral context를 modeling 할 수 있는 Large-Kernel Convolutional Attention Module (LKCAM)을 도입

< Overall of DegVoC >

Vocoding을 signal degradation task로 reformulate 하여 얻어지는 lightweight, high-quality vocoder
결과적으로 기존보다 우수한 성능을 달성

2. Revisiting Audio Vocoder Task

Frequency, mel-frequency, frame-size $\{F, F_{m}, T\}$에 대해, T-F domain에서 target waveform의 magnitude spectrum을 $|\mathbf{S}|\in\mathbb{R}^{F\times F}$, linear compression matrix로 instantiate 되는 Mel filter matrix를 $\mathcal{A}\in\mathbb{R}^{F_{m}\times T}$라고 하자
- Log-scale mel-spectrum $\mathbf{X}^{mel}$은 다음의 physical model을 통해 정의됨:
  (Eq. 1) $ \mathbf{X}^{mel}=\log(\mathcal{A}|\mathbf{S}|)$
- Log operation을 left-hand side로 absorbing 하면:
  (Eq. 2) $\mathbf{Y}=\exp(\mathbf{X}^{mel})=\mathcal{A}|\mathbf{S}|$
- Target complex spectrum $\mathbf{S}\in\mathbb{C}^{F\times T}$와 비교하여 Mel-spectrum $\mathbf{Y}$는 Phase Information Loss, Linear Magnitude Compression의 2가지 degradation이 존재함
  1. 따라서 Mel-to-target process는 해당 inverse process인 phase retrieval, magnitude recovery로 볼 수 있음
  2. 먼저 $\mathbf{Y}, \mathbf{S}$를 bridge 하기 위해 magnitude를 complex domain의 special case로 취급하여 $\underline{\mathbf{Y}}=\mathbf{Y}\exp(j0)$과 같이 mel-spectrum을 complex domain으로 generalize 함
    - $0$ : all-zero phase spectrum
  3. 추가적으로 $\mathbf{E}=|\mathbf{S}|\exp(j0)-\mathbf{S}$와 같이 target spectrum과 all-zero phase 간의 phase gap을 reflect 하는 residual term $\mathbf{E}$를 도입함
- 그러면 (Eq. 2)는 다음과 같이 rewrite 됨:
  (Eq. 3) $\underline{\mathbf{Y}}=\mathcal{A}(\mathbf{S}+|\mathbf{S}|\exp(j0)-\mathbf{S}) =\mathcal{A}\mathbf{S}+\mathbf{E}'$
  - $\mathbf{E}'=\mathcal{A}\mathbf{E}\in\mathbb{E}^{F_{m}\times T}$ : Mel-domain의 compressed residual
- 아래 그림과 같이 Mel-specturm과 compressed residual term $\mathbf{E}'$을 비교해 보면, $\mathbf{E}'$은 Mel-spectrum과 유사한 structural pattern을 share 함
  - Phase difference를 mainly reflect 하고 target spectral magnitude로 weight 되기 때문

한편으로 (Eq. 3)은 signal recovery를 위한 classical optimization problem으로 볼 수 있고, 여기서 논문은 degraded observation $\underline{\mathbf{Y}}$로부터 $\mathbf{S}$를 restore 하는 것을 목표로 함
- $\mathbf{S}$에 대한 optimization은 Maximum A Posteriori (MAP)를 사용하여 얻어짐:
  (Eq. 4) $\log P(\mathbf{S}|\underline{\mathbf{Y}})\propto\log P(\underline{\mathbf{Y}}|\mathbf{S})+\log P(\mathbf{S})$
  - $\log P(\underline{\mathbf{Y}}|\mathbf{S})$ : log-likelihood, $\log P(\mathbf{S})$ : target spectrum의 log-prior
- Residual term $\mathbf{E}'$이 zero-mean time-varying complex Gaussian (TVCG) distribution을 따른다고 가정하면, (Eq. 4)를 다음과 같이 rewrite 할 수 있음:
  (Eq. 5) $\mathbf{S}^{*}=\arg\min_{\mathbf{S}}\frac{1}{\sigma^{2}_{t}}\left|\left| \underline{\mathbf{Y}}-\mathcal{A}\mathbf{S}\right|\right|^{2}_{2}+\alpha\mathcal{G}(\mathbf{S})$
  - $\sigma_{t}$ : time-varying variance, $\mathcal{G}(\cdot)$ : $\mathbf{S}$의 regularization function
- 해당 optimization problem은 2-stage paradigm을 활용한 reculsive solution을 사용함:
  1. Initialization Step
    - Simple linear operation을 통해 degraded observation에서 basic structure를 recovering 함
  2. Alternating Update Step
    - ISTA algorithm의 $L_{p}$ sparseness와 같은 $\mathcal{G}(\cdot)$의 pre-defined prior assumption에 따라 signal detail을 further restoring 하고, Proximal Gradient Descent (PGD), ADMM 같은 iterative estimation을 적용함
- 최근에는 DNN을 활용하여 data-driven manner로 prior term을 modeling 할 수 있으므로, 논문은 $\mathbf{S}$의 solution을 다음의 두 sub-procedure로 reformulate 함
  1. Initialization Solver
    - Linear degradation을 기반으로 Mel-spectrum을 linear T-F domain으로 transform 함
  2. Deep Prior Solver
    - Neural network를 활용하여 remaining spectral detail을 further recover 함

3. Method

DegVoC는 2-step optimization pipeline을 활용함
- First step에서는 T-F scale에서 basic spectral structure를 recover 함
- Second step에서는 deep prior solver를 통해 remaining spectral detail을 restore 함

- Initialization Solver

Linear degradation prior를 활용하여 basic spectral structure를 retrieve 할 수 있음
- Magnitude initialization을 위해 다음을 고려할 수 있음:
  1. Matrix Transpose
    - $\mathcal{A}$를 sampling matrix로 취급하여 initialization 시 $\mathcal{A}$의 transpose $\mathcal{A}^{\top}\mathbf{Y}$를 사용함
    - $(\cdot)^{\top}$ : transpose operation
  2. Matrix Pseudo-Inverse
    - $F_{m}$은 $F_{m}\ll F$를 만족하므로 spectrum을 perfectly recover 하는 것은 불가능함
    - 따라서 pseudo-inverse $\mathcal{A}^{\dagger}\mathbf{Y}$를 사용하고, 이때 $\mathcal{A}^{\dagger}\in\mathbb{R}^{F\times F_{m}}$은 $\mathcal{A}$의 pseudo-inverse에 해당함
  3. Learnable
    - Model optimization을 위해 앞선 transpose, pseudo-inverse에 대한 fixed matrix 대신 learnable matrix weight를 활용할 수 있음
- Phase initialization은 all-zero로 설정함

- Deep Prior Solver

기존 neural vocoder는 target spectrum estimation을 위해 ResNet, ConvNeXt와 같은 full-band module을 사용함
- BUT, 해당 full-band module은 T-F domain에서 spectrum의 hierarchical prior를 neglect 함
  - e.g., harmonic component는 low-/mid-frequency region에 concentrate 됨
- 따라서 논문은 RFWave와 같이 sub-band distribution을 modeling 하고 encoding 하는 것을 목표로 함
- 구조적으로 deep prior solver는 Hierarchical Band Division Module (HBDM), Large Kernel Convolutional Attention Module (LKCAM), Hierarchical Band Merge Module (HBMM)으로 구성됨
  - 이때 DegVoC는 기존 recursive-based model과 달리 1-iteration step만 사용함
HBDM/HBMM
- Initialization solver output $\tilde{\mathbf{S}}^{(0)}\in\mathbb{C}^{F\times T}$가 주어지면, channel-axis를 따라 concatenate 하여 real-value version으로 convert 함:
  (Eq. 6) $\underline{\tilde{\mathbf{S}}}^{(0)}=\text{Cat}\left(\mathcal{R}\left(\tilde{\mathbf{S}}^{(0)}\right), \mathcal{I}\left(\tilde{\mathbf{S}^{(0)}}\right)\right) \in\mathbb{R}^{2\times F\times T}$
  - $\text{Cat}(\cdot)$ : concatenation operation, $\{\mathcal{R},\mathcal{I}\}$ : real/imaginary operation
- 이때 harmonic component가 low-/mid-frequency region에 lie 된다는 것을 고려하여 sub-band division에 uneven strategy를 적용해 $K$ region을 설정함
  1. $k$-th region의 경우 frequency size를 compress 하기 위해, separate Conv2d를 적용한 다음 Layer Normalization (LN)을 적용함:
    (Eq. 7) $\mathbf{F}_{in,k}=\text{LN}\left(\text{Conv2d}\left(\underline{\tilde{\mathbf{S}}}_{k}^{(0)} \right)\right)\in\mathbb{R}^{C\times N_{k}\times T}$
    - $\{C,N_{k}\}$ : channel 수, compressed sub-band 수
  2. 이후 모든 compressed representation을 concatenate 함:
    (Eq. 8) $\mathbf{F}_{in}=\text{Cat}\left(\mathbf{F}_{in,1},...,\mathbf{F}_{in,K}\right)\in \mathbb{R}^{C\times N\times T}$
    - $N=\sum_{k}N_{k}$
- Spectral decoding 시에는 opposite process를 수행함
  1. Input $\mathbf{O}\in\mathbb{R}^{C\times N\times T}$는 $K$ region으로 split 되고 각 feature region $\mathbf{O}_{k}$는 pointwise Conv2d, LN, GELU를 pass 함
    - Transposed Conv2d는 target estimation을 위해 사용됨
  2. Spectral magnitude estimation 시 exponential function은 non-negativity를 보장하기 위해 사용되고, $\text{Atan2}(\cdot)$은 phase estimation을 위해 사용됨
LKCAM
- High frequency component generation은 low-/mid-frequency counterpart를 통해 guide 될 수 있음
  - 따라서 논문은 large kernel size를 가지는 convolution-style attention module을 도입함
- 구조적으로 LKCAM은 $P=8$ block으로 구성되고, 각 block은 Convolutional Attention Unit (CAU)와 ConvFFN의 2-part를 가짐
  1. 특히 CAU에서는 기존 self-attention을 convolutional modulation layer로 replace 하여 사용함
    - ConvFFN의 경우 detail encoding을 위해 residual connection이 있는 DWConv2d를 사용함
  2. Input $\mathbf{H}^{(p)}\in\mathbb{R}^{C\times N\times T}$가 주어지면, value $\mathbf{V}$의 feature를 modulate하기 위해 similarity score matrix $\mathbf{A}$ 대신 convolution operation을 적용함:
    (Eq. 9) $\mathbf{Z}^{(p)}=\mathbf{A}\otimes \mathbf{V}$
    (Eq. 10) $\mathbf{A}=\text{LKDWConv2d}\left(\text{GELU}\left(\text{PConv2d}\left( \text{LN}\left(\mathbf{H}^{(p)}\right)\right)\right)\right)$
    (Eq. 11) $\mathbf{V}=\text{PConv2d}\left(\mathbf{H}^{(p)}\right)$
    - $\text{LKDWConv2d}(\cdot)$ : frequency/frame axis를 따라 large kernel $(I_{f},I_{t})=\{9,11\}$를 가지는 depthwise Conv2d, $\otimes$ : elementwise multiplication

- Loss Function

RNDVoC를 따라 reconstruction, adversarial loss를 모두 도입함
- Reconstruction loss는 amplitude loss $\mathcal{L}_{a}$, real/imaginary loss $\mathcal{L}_{ri}$, phase loss $\mathcal{L}_{p}$, Mel loss $\mathcal{L}_{m}$, consistency loss $\mathcal{L}_{c}$로 구성됨:
  (Eq. 12) $\mathcal{L}_{rec}=\lambda_{a}\mathcal{L}_{a}+\lambda_{ri}\mathcal{L}_{ri}+\lambda_{p}\mathcal{L}_{p}+\lambda_{m}\mathcal{L}_{m}+\lambda_{c}\mathcal{L}_{c}$
  - $\{\lambda_{a},\lambda_{ri},\lambda_{p},\lambda_{m},\lambda_{c}\}=\{45, 45, 100, 45, 45\}$ : weight
- Adversarial loss의 경우 MPD와 MRD 기반의 hinge loss를 사용함:
  (Eq. 13) $\mathcal{L}_{D}=\frac{1}{M}\sum_{m=1}^{M}\max(0,1+D_{m}(\tilde{s}))$
  - $D_{m}$ : $m$-th sub-discriminator
- Generator의 adversarial loss는:
  (Eq. 14) $\mathcal{L}_{g}=\frac{1}{M}\sum_{1}^{M}\max(0,1-D_{m}(\tilde{s}))$
- 추가적으로 feature matching loss를 반영한 final generator loss는:
  (Eq. 15) $\mathcal{L}_{G}=\mathcal{L}_{rec}+\lambda_{g}\mathcal{L}_{g}+\lambda_{fm}\mathcal{L}_{fm}$
  - $\{\lambda_{g},\lambda_{fm}\}=\{1,1\}$

4. Experiments

- Settings

Dataset : LibriTTS
Comparisons : HiFi-GAN, iSTFTNet, Avocodo, BigVGAN, APNet, Vocos, FreGrad, PriorGrad, RFWave, WaveFM

- Results

전체적으로 DegVoC의 성능이 가장 우수함

Out-of-Distribution sample에 대해서도 뛰어난 성능을 달성함

MUSHRA score 측면에서도 뛰어난 성능을 보임

DegVoC는 harmonic detail을 recover 할 수 있음

Ablation Study
- Pseudo-inverse operation을 사용하면 더 나은 성능을 달성할 수 있음

$\{9,11\}$의 kernel size를 사용했을 때 최고의 성능을 얻음

CAU, LKCAB 모두 성능 향상에 유효함

'Paper > Vocoder' 카테고리의 다른 글

[Paper 리뷰] Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-Step High-Fidelity Audio Generation (0)	2026.04.27
[Paper 리뷰] ComVo: Toward Complex-Valued Neural Networks for Waveform Generation (0)	2026.04.07
[Paper 리뷰] WaveNeXt2: ConvNeXt-based Fast Neural Vocoders with Residual Denoising and Sub-Modeling for GAN and Diffusion Models (0)	2026.03.16
[Paper 리뷰] Wave-Trainer-Fit: Neural Vocoder with Trainable Prior and Fixed-Point Iteration Towards High-Quality Speech Generation from SSL Features (0)	2026.03.04
[Paper 리뷰] BridgeVoC: Neural Vocoder with Schrodinger Bridge (0)	2025.10.03

최근에 올라온 글

최근에 달린 댓글

« 2026/07 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] DegVoC: Revisiting Neural Vocoder from a Degradation Perspective

DegVoC: Revisiting Neural Vocoder from a Degradation Perspective

1. Introduction

2. Revisiting Audio Vocoder Task

3. Method

- Initialization Solver

- Deep Prior Solver

- Loss Function

4. Experiments

- Settings

- Results

'Paper > Vocoder' 카테고리의 다른 글

티스토리툴바

티스토리 뷰

[Paper 리뷰] DegVoC: Revisiting Neural Vocoder from a Degradation Perspective

document.addEventListener("DOMContentLoaded", function() { renderMathInElement(document.body, { delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false} ] });});

DegVoC: Revisiting Neural Vocoder from a Degradation Perspective

1. Introduction

2. Revisiting Audio Vocoder Task

3. Method

- Initialization Solver

- Deep Prior Solver

- Loss Function

4. Experiments

- Settings

- Results

'Paper > Vocoder' 카테고리의 다른 글

티스토리툴바