[Paper 리뷰] RFWave: Multi-Band Rectified Flow for Audio Waveform Reconstruction

티스토리 뷰

Paper/Vocoder

[Paper 리뷰] RFWave: Multi-Band Rectified Flow for Audio Waveform Reconstruction

feVeRin 2025. 3. 9. 12:24

RFWave: Multi-Band Rectified Flow for Audio Waveform Reconstruction

Diffusion model은 waveform reconstruction에 효과적이지만 상당한 sampling step이 필요하므로 latency 문제가 존재함
RFWave
- Complex spectrogram을 생성하고 frame-level에서 모든 subband를 simultaneously process 함
- Straight transport trajectory를 위해 Rectified Flow를 도입
논문 (ICLR 2025) : Paper Link

1. Introduction

Audio waveform reconstruction은 raw audio data에서 derive 된 low-dimensional feature를 perceptible sound로 변환하는 것을 목표로 함
- LPCNet과 같은 autoregressive method는 sample point의 sequential prediction으로 인해 generation speed가 느림
- MelGAN, Parallel WaveGAN, HiFi-GAN, Vocos와 같은 Generative Adversarial Network (GAN)은 sample point를 parallel predict 하여 더 빠른 generation이 가능함
  - BUT, GAN-based waveform reconstruction model은 discriminator 설계의 어려움과 mode-collapse 문제가 존재함
- 그 외에도 WaveGrad, DiffWave, PriorGrad, FreGrad, FastDiff, SpecGrad 등과 같은 diffusion model 역시 stable training과 high-quality waveform reconstruction이 가능함
- BUT, diffusion model은 GAN에 비해 generation speed가 느림:
  1. High-quality sample을 얻기 위해 상당한 sampling step이 필요하기 때문
  2. Waveform sample point level에서 동작하기 때문
    - Frame-rate resolution에서 sample-rate resolution으로 transition 하기 위해 multiple upsampling이 필요하므로 sequence length가 늘어나 GPU memory와 computational demand가 증가함

-> 그래서 GAN-based method 수준의 generation speed를 달성할 수 있는 diffusion-type waveform reconstruction model인 RFWave를 제안

RFWave
- Slow sampling 문제를 해결하기 위해 straight line으로 data와 noise를 connect 하는 Rectified Flow를 도입
- Sample point-level modeling의 computational demand 문제를 해결하기 위해 STFT frame-level을 활용
- 추가적으로 Energy-balanced loss, Overlap loss, STFT loss의 3가지 ehanced loss function과 optimized sampling strategy를 결합

< Overall of RFWave >

ConvNeXt-V2 backbone에 기반한 multi-band strategy와 Rectified Flow, enhanced loss를 결합하여 sampling step을 줄임
결과적으로 기존보다 diffusion-based model 보다 뛰어난 성능을 달성

2. Background - Rectified Flow

Rectified Flow는 generative modeling, domain transfer를 위한 Ordinary Differential Equation (ODE)-based framework를 제시함
- Empirical observation을 기반으로 $\mathbb{R}^{d}$에서 두 distribution $\pi_{0}, \pi_{1}$을 connect 하는 mapping을 고려하자:
  (Eq. 1) $\frac{\text{d}Z_{t}}{\text{d}t}=v(Z_{t},t),\,\,\,\text{initialized from}\,\, Z_{0}\sim\pi_{0},\,\, \text{such that}\,\,Z_{1}\sim\pi_{1}$
  - $v:\mathbb{R}^{d}\times[0,1]\rightarrow\mathbb{R}^{d}$ : velocity field
- 해당 field를 학습하기 위해서는 Mean Square objective를 minimize 해야 함:
  (Eq. 2) $\min_{v}\mathbb{E}_{(X_{0},X_{1})\sim\gamma}\left[\int_{0}^{1}\left|\left| \frac{\text{d}}{\text{d}t}X_{t}-v(X_{t},t)\right|\right|^{2}\text{d}t\right],\,\,\,\text{with}\,\,X_{t}=\phi(X_{0},X_{1},t)$
  - $X_{t}=\phi(X_{0},X_{1},t)$ : $X_{0},X_{1}$ 간의 time-differentiable interpolation, $\frac{\text{d}}{\text{d}t}X_{t}=\partial_{t}\phi(X_{0},X_{1},t)$
  - $\gamma$ : $(\pi_{0},\pi_{1})$의 any coupling (e.g., $\pi_{0},\pi_{1}$의 separately observed data를 기반으로 empirical sampling을 allow 하는 independent coupling $\gamma=\pi_{0}\times \pi_{1}$)
- 여기서 다음과 같은 simple choice를 활용할 수 있음:
  (Eq. 3) $X_{t}=(1-t)X_{0}+tX_{1}\Rightarrow \frac{\text{d}}{\text{d}t}X_{t}=X_{1}-X_{0}$
  - 해당 simplification은 inference speed를 accelerating 하는 linear trajectory를 제공함
- 일반적으로 velocity field $v$는 deep neural network로 represent 되고 (Eq. 2)의 solution은 stochastic gradient method를 통해 approximate 됨
  1. (Eq. 1)의 ODE를 approximate 하기 위해서는 forward Euler method와 같은 numerical solver가 사용됨
  2. 즉, 아래 (Eq. 4)를 통해 value를 계산함:
    (Eq. 4) $Z_{t+\frac{1}{n}}=Z_{t}+\frac{1}{n}v(Z_{t},t),\,\,\,\forall t\in\{0,...,n-1\}/n$
    - 여기서 simulation은 $n$ step에 걸쳐 $\epsilon=1/n$ step interval로 execute 됨
  3. Velocity field는 conditional information을 incorporate 할 수 있는 capacity를 가지고 있으므로, compressed acoustic representation으로부터 waveform reconstruction을 수행할 수 있음
    - 따라서 $\mathcal{C}$를 $X_{1}$에 대한 conditional information이라고 할 때, (Eq. 2)의 $v(Z_{t},t)$는 $v(Z_{t},t|\mathcal{C})$와 같이 modify 됨

3. Method

RFWave는 ConvNeXt-V2 backbone을 가지는 Multi-Band Rectified Flow를 활용하여 time/frequency domain 내에서 noisy sample에 대한 modeling을 지원하면서, neural network는 STFT frame level에서 동작하도록 함
- 이를 통해 RFWave는 10 sampling step 만으로도 high-quality waveform을 생성할 수 있음

- Multi-Band Rectified Flow

Model Structure
- RFWave는 unique subband index로 distinguish 되는 모든 frequency band는 동일한 model을 share 함
  1. 이때 주어진 sample의 subband는 single batch로 group 되어 simultaneous training/inference를 지원함
  2. 이를 통해 inference latency를 줄이고 subband를 independently modeling 하여 error accumulation을 줄임
    - Lower band에 higher band를 conditioning 하면 error accumulation으로 인해 추론 중에 lower band의 inaccuracy가 higher band에 adversely affect 할 수 있기 때문
- 구조적으로 논문은 noisy sample $X_{t}$를 velocity $v_{t}$에 mapping 함
  1. 각 subband의 경우 subband noisy sample $X_{t}^{i_{sb}}$가 ConvNeXt-V2 backbone에 전달되고,
  2. Time $t$, subband index $i_{sb}$, conditional input (mel-spectrogram 또는 EnCodec token) $\mathcal{C}$, optional EnCodec bandwidth index $i_{bw}$에 따라 velocity $v_{t}^{i_{sb}}$를 predict 함
- ConvNeXt-V2 backbone은 Fourier feature를 사용함
  1. 여기서 $X_{t}^{i_{sb}},t,\mathcal{C}$, Fourier feature는 channel dimension을 따라 concatenate 된 다음, linear layer를 통과하여 ConvNeXt-V2 block의 input을 구성함
  2. Sinusoidal $t$ embedding은 optional $i_{bw}$ embedding과 함께 각 ConvNeXt-V2 block input에 element-wise add 됨
  3. $i_{bw}$는 EnCodec token의 decoding 중에 사용되어 single model이 various bandwidth의 EnCodec token을 지원할 수 있도록 함
    - 추가적으로 $i_{sb}$는 Vocos와 같이 learnable embedding을 활용하는 Adaptive Layer Normalization module을 통해 incorporate 됨
- RFWave는 2가지 modeling option을 제공함
  1. Gaussian noise를 time-domain에서 waveform에 directly mapping 하는 방법
    - $X_{0},X_{1},X_{t},v_{t}$ 모두 time-domain에 위치
  2. Gaussian noise를 complex spectrogram에 mapping 한 다음, $X_{0},X_{1},X_{t},v_{t}$를 frequency-domain에 mapping 하는 방법
    - 이때 $X_{t}^{i_{sb}}, v_{t}^{i_{sb}}$는 frequency-domain에서 consistently represent 되어 neural network가 frame-level에서 동작하도록 함
- 결과적으로 RFWave는 frame-level feature를 처리하므로 waveform sample point level에서 동작하는 PriorGrad와 같은 기존 diffusion vocoder에 비해 더 나은 memory efficiency를 달성할 수 있음
  - 실제로 PriorGrad는 30GB GPU에서 44.1kHz의 6s audio clip을 사용하지만, RFWave는 동일한 memory resource로 117s clip을 처리할 수 있음

Operating with $X_{t}$ in the Time Domain and Waveform Equalization
- RFWave는 frame-level에서 동작하도록 설계되었으므로, $X_{t},v_{t}$가 time-domain에 있을 때 STFT/iSTFT를 사용해야 함
  - 여기서 $X_{1}$은 waveform이고 $X_{0}$는 identical shape의 noise, $X_{t},v_{t}$는 (Eq. 3)에서 derive 됨
- $T$를 sample point의 waveform length라고 하면, $X_{t},v_{t}$ dimension은 $[1,T]$를 따름
  1. 그러면 STFT operation 이후 full-band complex spectrogram을 equally dividing 한 다음, subband를 추출하여 $X_{t}^{i_{sb}}$를 얻음
  2. 각 subband $X_{t}^{i_{sb}}$는 backbone에서 independently process 되어 $v_{t}^{i_{sb}}$를 예측함
    - 해당 prediction은 iSTFT operation 전에 merge back 됨
  3. 결과적으로 backbone에서 처리된 $X_{t}^{i_{sb}}$는 $[2d_{s},F]$의 dimension을 가짐
    - $d_{s}$ : subband complex spectrum의 frequency bin 수, $F$ : frame 수
    - 각 subband의 real/imaginary part는 $2d_{s}$-dimensional feature를 구성하기 위해 interleave 됨
- White Gaussian noise는 frequency band에서 uniform energy를 가지지만 waveform energy profile은 서로 다름
  1. 즉, diffusion model training에서도 band 간 energy equalization이 유용할 수 있음
  2. 따라서 time-domain model은 Pseudo-Quadrature Mirror Filter (PQMF) bank를 사용하여 input waveform을 subband로 decompose 함
    - 이후 해당 subband를 equalize 한 다음, recombine 하여 equalized waveform을 얻음
    - PQMF는 waveform equalization에서만 사용되고 complex spectrogram을 subband로 division 하는 것과는 관련이 없음
  3. Waveform equalization 시에는 mean-variance normalization이 사용되고, training 중에 계산된 각 waveform subband의 mean/variance에 대한 exponential moving average를 사용함
    - 이는 same statistics를 활용하여 transformation을 effectively invert 할 수 있음을 보장함

Alternative Approach: Operating with $X_{t}$ in the Frequency Domain and STFT Normalization
- $X_{t}, v_{t}$가 frequency domain에 있는 경우, STFT/iSTFT는 불필요함
  - i.e.) $X_{1}$이 waveform complex spectrogram이고, $X_{0}$가 identical shape의 noise인 경우
- $d$를 complex spectrogram의 frequency bin이라고 할 때, $X_{t}, v_{t}$의 dimension은 $[2d,F]$가 됨
  1. 그러면 full-band complex spectrogram을 equally partitioning 하여 $X^{i_{sb}}_{t}$를 추출하면 $[2d_{s},F]$ shape를 가지고, 이는 ConvNeXt-V2 backbone에서 처리됨
  2. Frequency-domain model에서 waveform은 equalization 없이 complex spectrogram으로 transform 됨
    - Preprocessing에는 complex spectrogram의 dimension-wise mean-variance normalization이 사용됨
- 추론 시 time-domain에서 $X_{t},v_{t}$를 사용하려면 각 sampling step에서 STFT/iSTFT가 모두 필요함
  - 반면 $X_{t},v_{t}$가 frequency-domain에 있는 경우, entire sampling process 이후에 single iSTFT만 필요함
- 결과적인 측면에서는 computational overhead에도 불구하고 time-domain configuration이 frequency-domain 보다 더 나은 성능을 달성하고, high-frequency detail을 preserving 하는데 더 유용한 것으로 나타남

- Loss Function

Energy-balanced Loss
- (Eq. 2)의 Mean Squared Error (MSE)로 인해 expected silent region에서 low-volume noise가 나타나는 경향이 있음
  1. MSE는 predicted value와 ground-truth 간의 absolute distrotion을 계산하기 때문
  2. 즉, silent region에서 small absolute error는 overall MSE loss에 minimally contribute 하므로 training 중에 model은 해당 error를 prioritize 하지 않음
- 결과적으로 model은 silent area에서 minor deviation을 suppress 하지 못하므로 perceptible noise가 발생함
  - 반대로 high-amplitude region의 large error는 MSE loss에 큰 영향을 주므로 training 중에 model은 해당 area의 error를 reduce 하는데 집중하게 됨
- 따라서 RFWave는 Energy-balanced loss를 도입하여 해당 문제를 해결함
  1. 먼저 energy-balanced loss는 time-axis에 걸쳐 region의 volume/energy에 따라 error를 differently weight 함
    - 즉, 각 frequency subband에 대해 ground-truth velocity의 feature dimension을 따라 standard deviation을 계산하여 $[1,F]$ size의 weighting coefficient를 구성함
    - 해당 vector는 subband의 frame-level energy를 반영함
  2. 이후 ground-truth와 predicted velocity는 해당 vector로 divide 되어 사용되고, frequency-domain model의 경우 (Eq. 2)의 training objective는 다음과 같이 adjust 됨:
    (Eq. 5) $\min_{v}\mathbb{E}_{X_{0}\sim\pi_{0},(X_{1},\mathcal{C})\sim D}\left[\int_{0}^{1}|| (X_{1}-X_{0})/\sigma-v(X_{t},t|\mathcal{C})/\sigma||^{2}\text{d}t\right],$
    $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \text{with}\,\, \sigma=\sqrt{\text{Var}_{1}(X_{1}-X_{0})}\,\,\text{and}\,\, X_{t}=tX_{1}+(1-t)X_{0}$
    - $D$ : $X_{1},\mathcal{C}$가 pair 되는 dataset, $\text{Var}_{1}$ : feature dimension에 대한 variance
    - Time-domain model의 경우, energy balancing operation은 ISTFT process에 precede 함
- 해당 energy-balanced loss는 low-volume region에서 relative error를 minimize 하여 전반적인 성능을 향상함

Dividing Complex Spectrogram into Subband

Overlap Loss
- Multi-band structure에서 각 subband는 independently predict 되므로 subband 간에 potential inconsistency가 발생할 수 있음
- 따라서 해당 inconsistency를 완화하기 위해 RFWave는 Overlap loss를 도입함
  1. 이때 full-band complex spectrogram을 dividing 할 때 subband 간의 overlap을 maintain 하는 것을 포함함
  2. 논문에서는 8-dimensional overlap을 사용하고 training phase에서 overlapped prediction의 MSE를 minimize 함
  3. 추론 시에는 overlap을 remove 하고 모든 subband를 merge 하여 full-band complex spectrogram을 recreate 함
- 각 subband를 predict 할 때 model은 overlap section과 main section 간의 consistency를 maintain 함
  - 이때 overlap은 subband 간의 consistency를 maintain 하는 anchor 역할을 수행함
- 한편 모든 subband를 parallel predict 하는 대신 lower band를 기반으로 higher band를 modeling 하면 subband 간의 consistency를 향상할 수 있음
  1. BUT, 추론 시 lower band가 incorrectly predict 되면 해당 lower band를 condition으로 하는 higher band도 inaccurate 해짐
  2. RFWave의 overlap loss는 training 중에 subband 간의 consistency를 maintain 하여 해당 문제를 해결함
    - 즉, 추론 시 각 subband는 independently predict 되므로, 한 subband에서 발생하는 error가 다른 subband에 negatively affect 하지 않음
STFT Loss
- Magnitude spectrogram derived loss는 HiFi-GAN, Vocos와 같은 GAN-based vocoder에서 주로 사용되지만, diffusion-based vocoder에서는 사용되지 않음
  - Noise prediction diffusion model의 formalization과 compatible 되지 않기 때문
- 따라서 RFWave는 STFT loss를 채택함
  1. (Eq. 5)에 따라 model output은 velocity에 대한 approximation으로 볼 수 있음:
    (Eq. 6) $v(X_{t},t|\mathcal{C})\approx\frac{\text{d}}{\text{d}t}X_{t}=X_{1}-X_{0}$
  2. 그러면 time $t$에서 $X_{1}$의 approximation은:
    (Eq. 7) $\tilde{X}_{1}\approx X_{0}+v(X_{t},t|\mathcal{C})$
- STFT loss는 approximation $\tilde{X}_{1}$에 적용될 수 있고, spectral convergence loss와 log-scale STFT-magnitude loss를 incorporating 할 수 있음
  - 결과적으로 STFT loss는 background noise가 있는 경우, artifact를 effectively reduce 함

- Selecting Time Points for Euler Method

일반적으로는 (Eq. 4)와 같이 Euler method에 대해 equal step interval을 사용하지만, RFWave는 transport trajectory의 straightness를 기반으로 sampling을 위한 time point를 select 하는 Equal Straightness를 채택함
- 여기서 time point는 각 step에서 straightness의 증가가 동일하도록 choice 됨
- 그러면 learned velocity field $v$의 straightness는 $S(v)=\int_{0}^{1}\mathbb{E}||(X_{1}-X_{0})-v(X_{t},t|\mathcal{C})||^{2}\text{d}t$와 같이 정의됨
  - 이는 trajectory에 대한 velocity의 deviation을 의미하므로, smaller $S(v)$는 straighter trajectory를 나타냄
- 해당 방식은 각 Euler step의 difficulty가 consistent 하도록 보장하여 model이 challenging region에서 더 많은 step을 수행하도록 함
  - 결과적으로 동일한 sampling step에 대해 equal interval approach보다 더 나은 성능을 얻을 수 있음

4. Experiments

- Settings

Dataset : LibriTTS, MTG-Jamendo, OpenCPop
Comparisons : PriorGrad, FreGrad, Vocos, BigVGAN

- Results

Comparison with Diffusion-based Method
- PriorGrad, FreGrad와 비교하여 RFWave는 가장 우수한 성능을 보임

Comparison with GAN-based Method
- BigVGAN, Vocos와 비교하여도 RFWave의 성능이 더 뛰어남

Out-of-Domain dataset인 MUSDB18에 대해서도 robust 한 성능을 보임

Discrete EnCodec Tokens Input
- EnCodec token에 대해서도 RFWave의 reconstruction 성능이 가장 뛰어남

Ablation Study
- Loss function 측면에서 제안된 각 loss가 추가될 때마다 성능이 개선됨

Rectified Flow는 DDPM과 비교하여 high-quality audio를 더 efficient 하게 생성함
- 특히 ConvNeXt-V2 backbone은 audio quality와 efficiency를 모두 향상할 수 있음
- 이때 backbone size를 증가시키는 경우, quality는 더 향상되지만 efficiency는 저하됨

Inference Speed
- RFWave는 BigVGAN보다 2배 이상 빠르고 더 적은 GPU memory를 사용함

'Paper > Vocoder' 카테고리의 다른 글

[Paper 리뷰] Cauchy Diffusion: A Heavy-Tailed Denoising Diffusion Probabilistic Model for Speech Synthesis (0)	2025.04.20
[Paper 리뷰] WaveFM: A High-Fidelity and Efficient Vocoder based on Flow Matching (0)	2025.03.30
[Paper 리뷰] PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation (0)	2025.03.08
[Paper 리뷰] FA-GAN: Artifacts-Free and Phase-Aware High-Fidelity GAN-based Vocoder (0)	2025.01.05
[Paper 리뷰] Efficient WaveGlow: An Improved WaveGlow Vocoder with Enhanced Speed (0)	2025.01.01

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] RFWave: Multi-Band Rectified Flow for Audio Waveform Reconstruction

RFWave: Multi-Band Rectified Flow for Audio Waveform Reconstruction

1. Introduction

2. Background - Rectified Flow

3. Method

- Multi-Band Rectified Flow

- Loss Function

- Selecting Time Points for Euler Method

4. Experiments

- Settings

- Results

'Paper > Vocoder' 카테고리의 다른 글

티스토리툴바