[Paper 리뷰] Wave-Trainer-Fit: Neural Vocoder with Trainable Prior and Fixed-Point Iteration Towards High-Quality Speech Generation from SSL Features

티스토리 뷰

Paper/Vocoder

[Paper 리뷰] Wave-Trainer-Fit: Neural Vocoder with Trainable Prior and Fixed-Point Iteration Towards High-Quality Speech Generation from SSL Features

feVeRin 2026. 3. 4. 13:15

Wave-Trainer-Fit: Neural Vocoder with Trainable Prior and Fixed-Point Iteration Towards High-Quality Speech Generation from SSL Features

Self-Supervised Learning과 같은 data-driven feature에 대해 high-quality waveform generation을 수행할 수 있음
WaveTrainerFit
- Trainable prior를 도입하여 target speech와 close 한 noise에서 inference process를 수행
- Reference-aware gain adjustment를 통해 trainable prior에 constraint를 impose
논문 (ICASSP 2026) : Paper Link

1. Introduction

Neural vocoder는 주로 mel-spectrogram으로부터 speech waveform을 생성함
- BUT, 최근 speech synthesis에서는 WavLM, XLS-R, HuBERT와 같은 Self-Supservised Learning (SSL) model의 data-driven feature가 주로 활용되고 있음
- 대표적으로 WaveFit은 Generative Adversarial Network (GAN)과 diffusion model을 combine 하여 mel-spectrogram 뿐만 아니라 SSL feature에 대해서도 우수한 waveform generation이 가능함
  - BUT, WaveFit은 기본적으로 mel-spectrogram을 가정하므로 SSL feature input에 대한 개선점이 존재함

-> 그래서 SSL feature에 대한 효과적인 waveform generation을 지원할 수 있는 WaveTrainerFit을 제안

WaveTrainerFit
- Variational AutoEncoder (VAE)-based trainable prior를 도입하여 target waveform과 close 한 noise에서 sampling을 수행
- Prior에 energy constraint를 impose 하여 gain adjustment를 지원

< Overall of WaveTrainerFit >

Trainable prior와 Fixed-point iteration을 기반으로 SSL feature를 합성하는 neural vocoder
결과적으로 기존보다 우수한 성능을 달성

2. Preliminary

- WaveFit: Neural Vocoder with Fixed-Point Iteration

WaveFit은 diffusion의 iterative processing과 GAN-based loss를 combine 한 iterative-style non-autoregressive neural vocoder에 해당함
- 특히 $T$ denoising mapping process를 통해 Gaussian noise $\mathbf{y}_{T}\in\mathbb{R}^{D}\sim \mathcal{N}(0,I)$와 SSL feature $\mathbf{c}$로부터 speech waveform $\mathbf{y}_{0}\in\mathbb{R}^{D}$를 생성함:
  (Eq. 1) $ \mathbf{y}_{t-1}=\hat{\mathcal{G}}(\mathbf{z}_{t}),\,\,\, \mathbf{z}_{t}=\mathbf{y}_{t}-\mathcal{F}_{\theta}(\mathbf{y}_{t},\mathbf{c},t)$
  (Eq. 2) $\hat{\mathcal{G}}(\mathbf{z}_{t})=\beta_{scale}\cdot\mathbf{z}_{t}/\max(\text{abs}(\mathbf{z}_{t}))$
  - $D$ : time-domain sample 수, $\mathcal{F}_{\theta}$ : noise component를 estimate 하는 DNN, $\beta_{scale}$ : scaling factor
  - $\text{abs}(\cdot)$ : input vector의 element-wise absolute value, $\hat{\mathcal{G}}(\mathbf{z}_{t})$ : self-gain adjustment operator
- 그러면 loss function $\mathcal{L}^{WF}$는:
  (Eq. 3) $\mathcal{L}^{WF}=\frac{1}{T}\sum_{t=0}^{T-1}\mathcal{L}_{G}^{gan}(\mathbf{x}_{0},\mathbf{y}_{t}) +\mathcal{L}_{D}^{gan}(\mathbf{x}_{0},\mathbf{y}_{t}) +\lambda_{S}\mathcal{L}^{S}(\mathbf{x}_{0},\mathbf{y}_{t})$
  - $\mathbf{x}_{0}$ : target waveform, $\mathcal{L}_{G}^{gan}, \mathcal{L}_{D}^{gan}$ : generator, discriminator의 loss function, $\lambda_{S}$ : weight parameter, $\mathcal{L}^{S}$ : multi-resolution STFT loss

- RestoreGrad: Diffusion Model with Trainable Prior

RestoreGrad는 VAE로 modeling 된 trainable prior와 diffusion model을 combine 함
- 먼저 RestoreGrad는 posterior distribution $\mathcal{N}(0,\Sigma_{post})$와 prior distribution $\mathcal{N}(0,\Sigma_{prior})$ 간의 Kullback-Leibler (KL) divergence를 minimize 함:
  (Eq. 4) $\mathcal{L}^{PM}(\Sigma_{post},\Sigma_{prior}) =\log\frac{|\Sigma_{prior}|}{|\Sigma_{post}|} +\text{tr}(\Sigma^{-1}_{prior}\Sigma_{post})$
  - $\Sigma_{prior},\Sigma_{post}$ : prior encoder $\mathcal{V}_{prior}(\mathbf{c})$, posterior encoder $\mathcal{V}_{post}(\mathbf{c},\mathbf{x}_{0})$에서 생성된 covariance matrix
  - $\mathbf{x}_{0}$ : target waveform, $\mathbf{c}$ : conditional feature
- 추가적으로 posterior encoder가 informative representation을 학습할 수 있도록 additional loss term $\mathcal{L}^{LR}(\mathbf{x}_{0}, \Sigma_{post})$를 도입함:
  (Eq. 5) $\mathcal{L}^{LR}(\mathbf{x}_{0},\Sigma_{post})=\log |\Sigma_{post}|+\bar{\alpha}_{T}\mathbf{x}_{0}^{T}\Sigma_{post}^{-1}\mathbf{x}_{0}$
  - $\bar{\alpha}_{T}$ : variance schedule에 기반한 weight
  - First term은 training collapse를 방지하는 regularization term에 해당하고 second term은 $\Sigma_{post}$가 target waveform $\mathbf{x}_{0}$와 동일한 power를 가지도록 guide 함

3. Method

- Motivation

논문은 WaveFit의 noise sampling 문제를 해결하기 위해 RestoreGrad의 trainable prior를 도입함
- 특히 trainable prior에 target speech의 energy에 대한 constraint를 impose 하면 reference-aware gain adjustment가 가능해짐
- 이를 통해 WaveTrainerFit은 data-driven feature에 대한 waveform modeling difficulty를 reduce 하여 fewer inference step 만으로 high-quality waveform generation을 수행할 수 있음

- Model Overview

논문은 trainable initial noise sampling을 위해 prior encoder와 posterior encoder를 도입함
- 먼저 conditional input $\mathbf{c}$에는 transposed 2D convolution layer를 적용하여 upsampling 된 SSL feature를 사용하고, 해당 $\mathbf{c}$는 posterior encoder, prior encoder, WaveFit DNN의 input으로 사용됨
- Training 시 initial noise는 SSL feature, target waveform에 condition 된 posterior distribution $\mathcal{N}(0,\Sigma_{post})$에서 sampling 됨
  - 추론 시에는 SSL feature에 condition 된 prior distribution $\mathcal{N}(0,\Sigma_{prior})$에서 smapling 됨

- Noise Sampling in Time-Frequency Domain

논문은 sequence length를 shorten 하고 modeling complexity를 줄이기 위해 time-frequency domain에 trainable prior를 incorporate 함
- 먼저 frequency bin size $F$, frame 수 $K$에 대해 posterior, prior encoder에서 얻어진 variance feature $\Sigma$는 $\mathbb{R}^{F\times K}$ shape로 change 됨
- 이후 time-domain initial noise $\mathbf{y}_{T}=\mathcal{S}(\Sigma)\in\mathbb{R}^{D}$는 다음과 같이 sampling 됨:
  (Eq. 6) $\mathcal{S}(\Sigma)=\text{iSTFT}(\mathcal{R}(\mathbf{N})\odot \Sigma+i\mathcal{I}(\mathbf{N})\odot \Sigma)$
  (Eq. 7) $\mathbf{N}=\text{STFT}(\epsilon)\in\mathbb{C}^{F\times K},\,\,\, \epsilon \in\mathbb{R}^{D}\sim \mathcal{N}(0,I)$
  - $\mathcal{R}(\cdot),\mathcal{I}(\cdot)$ : real/imaginary part를 추출하는 operator

- Loss Function and Gain Adjustment

WaveTrainerFit의 loss function은:
(Eq. 8) $\mathcal{L}^{TrainerFit}=\mathcal{L}^{WF}+\lambda_{PM}\mathcal{L}^{PM}+\mathcal{L}^{Guide}$
(Eq. 8)의 first term은 (Eq. 3)과 동일한 loss function으로 diffusion, GAN training을 담당함
- Second term은 (Eq. 4)의 loss term을 time-frequency domain으로 expand 하여 얻어지는 loss로, prior/posterior encoder output 간의 KL-divergence를 minimize 함
  - $\lambda_{PM}$ : weight parameter
- Third term은 posterior encoder output $\Sigma_{post}$에 guidance를 제공함:
  (Eq. 9) $\mathcal{L}^{Guide}=\left|\mathcal{E}(\Sigma_{post})-\mathcal{E}(|\mathbf{X}_{0}|^{2})\right| + \frac{\lambda_{Guide}}{FK}\sum_{f=0}^{F-1}\sum_{k=0}^{K-1}\frac{\Sigma_{post}[f,k]}{|\mathbf{X}_{0}|^{2}[f,k]}$
  - $\lambda_{Guide}$ : weight parameter, $|\mathbf{X}_{0}|^{2}\in\mathbb{R}^{F\times K}$ : target waveform의 power spectrogram
  - $f,k$ : frequency, time index, $\mathcal{E}(\cdot)$ : element-wise summation
- (Eq. 9)에서 first term은 posterior encoder output energy를 target speech energy와 match 하여 $\Sigma_{post}$가 target waveform에 close 한 energy를 가지도록 함
  1. Second term은 (Eq. 5)의 second term을 2D signal로 expand 하여 얻어지고, target spectrogram의 power를 softly reflecting 하여 posterior learning을 guide 함
  2. 결과적으로 reference-aware gain adjustment operator는 다음과 같이 얻어짐:
    (Eq. 10) $\mathcal{G}_{ssl}(\mathbf{z}_{t},\Sigma)=\sqrt{\left(\mathcal{E}(\Sigma)/\left(\mathcal{E}\left( |\mathbf{z}_{t}|^{2}\right)+s\right)\right)}\mathbf{z}_{t}$
    - $s$ : zero-division을 방지하기 위한 scalar

4. Experiments

- Settings

Dataset : LibriTTS
Comparisons : HiFi-GAN, WaveFit

- Results

전체적으로 WaveTrainerFit의 성능이 가장 우수함

Performance at Each Iteration and Processing Speed
- WaveTrainerFit은 1 iteration 만으로도 우수한 reconstruction이 가능함

Performance for SSL Features from Different Layers
- WavLM의 각 layer 별 feature에 대해서도 뛰어난 reconstruction 성능을 보임

'Paper > Vocoder' 카테고리의 다른 글

[Paper 리뷰] DegVoC: Revisiting Neural Vocoder from a Degradation Perspective (0)	2026.03.30
[Paper 리뷰] WaveNeXt2: ConvNeXt-based Fast Neural Vocoders with Residual Denoising and Sub-Modeling for GAN and Diffusion Models (0)	2026.03.16
[Paper 리뷰] BridgeVoC: Neural Vocoder with Schrodinger Bridge (0)	2025.10.03
[Paper 리뷰] RNDVoC: Learning Neural Vocoder from Range-Null Space Decomposition (0)	2025.10.01
[Paper 리뷰] AF-Vocoder: Artifact-Free Neural Vocoder with Global Artifact Filter (0)	2025.08.21

최근에 올라온 글

최근에 달린 댓글

« 2026/07 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Wave-Trainer-Fit: Neural Vocoder with Trainable Prior and Fixed-Point Iteration Towards High-Quality Speech Generation from SSL Features

Wave-Trainer-Fit: Neural Vocoder with Trainable Prior and Fixed-Point Iteration Towards High-Quality Speech Generation from SSL Features

1. Introduction

2. Preliminary

- WaveFit: Neural Vocoder with Fixed-Point Iteration

- RestoreGrad: Diffusion Model with Trainable Prior

3. Method

- Motivation

- Model Overview

- Noise Sampling in Time-Frequency Domain

- Loss Function and Gain Adjustment

4. Experiments

- Settings

- Results

'Paper > Vocoder' 카테고리의 다른 글

티스토리툴바

티스토리 뷰

[Paper 리뷰] Wave-Trainer-Fit: Neural Vocoder with Trainable Prior and Fixed-Point Iteration Towards High-Quality Speech Generation from SSL Features

document.addEventListener("DOMContentLoaded", function() { renderMathInElement(document.body, { delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false} ] });});

Wave-Trainer-Fit: Neural Vocoder with Trainable Prior and Fixed-Point Iteration Towards High-Quality Speech Generation from SSL Features

1. Introduction

2. Preliminary

- WaveFit: Neural Vocoder with Fixed-Point Iteration

- RestoreGrad: Diffusion Model with Trainable Prior

3. Method

- Motivation

- Model Overview

- Noise Sampling in Time-Frequency Domain

- Loss Function and Gain Adjustment

4. Experiments

- Settings

- Results

'Paper > Vocoder' 카테고리의 다른 글

티스토리툴바