[Paper 리뷰] ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps

티스토리 뷰

Paper/SVS

[Paper 리뷰] ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps

feVeRin 2025. 5. 2. 17:18

ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps

Diffusion model을 활용한 singing voice synthesis는 high-quality sample을 얻을 수 있지만 추론 속도의 한계가 있음
ConSinger
- Mimimal step 만으로 singing voice synthesis를 수행하기 위해 Consistency Model을 채택
- 특히 training 중에 consistency constraint를 적용
논문 (ICASSP 2025) : Paper Link

1. Introduction

Singing Voice Synthesis (SVS)는 emotionally realistic human audio를 생성하는 것을 목표로 함
- 이를 위해 music score를 acoustic feature로 interpret 하는 acoustic model, generated feature를 audio waveform으로 변환하는 vocoder로 구성된 two-stage approach를 활용할 수 있음
- 특히 RefineGAN과 같이 Generative Adaversarial Network (GAN)이나 DiffSinger와 같이 Denoising Diffusion Probabilistic Model (DDPM)을 채택하면 high-quality singing voice를 얻을 수 있음
  - BUT, 해당 방식은 unstable training과 추론 속도의 한계가 있음
- 한편으로 CoMoSpeech, CM-TTS와 같이 consistency model을 활용하면 high speed generation과 sampling quality를 balancing 할 수 있음
  - BUT, 대부분 text-to-speech (TTS) task에서만 활용되고 distillation에 대한 burden이 존재함

-> 그래서 SVS task를 위한 consistency model인 ConSinger를 제안

ConSinger
- Teacher model 없이 single training network 만을 사용하여 consistency loss를 optimize
- Shallow Diffusion Mechanism을 기반으로 singing voice quality를 더욱 향상

< Overall of ConSinger >

Minimal step으로 real-time SVS가 가능한 consistency model
결과적으로 기존보다 뛰어난 합성 품질과 속도를 달성

2. Background

Diffusion model의 한계를 해결하기 위해 consistency model을 고려할 수 있음
- 먼저 consistency model은 Probabilistic Flow-Ordinary Differentiable Equation (PF-ODE)에 대한 다음 condition을 기반으로 얻어짐:
  (Eq. 1) $ f(\mathbf{x}_{\epsilon},\epsilon )=\mathbf{x}_{\epsilon}$
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, f(\mathbf{x}_{t},t)=f(\mathbf{x}_{t'},t')$
  - EDM을 따라 PF-ODE의 starting point는 small positive number $\epsilon$으로 설정됨
- 그러면 (Eq. 1)의 두 condition을 satisfy 하기 위해 skip connection을 적용할 수 있음:
  (Eq. 2) $f_{\theta}(\mathbf{x}_{t},t)=c_{skip}(t)\mathbf{x}_{t}+c_{out}(t)F_{\theta}(\mathbf{x}_{t},t)$
  - $F_{\theta}(\cdot, \cdot)$ : neural network, $f_{\theta}(\cdot,\cdot)$ : model의 final output
  - $c_{skip}, c_{out}$ : differentiable function ($t=\epsilon, c_{skip}=1, c_{out}=0$)
- 여기서:
  (Eq. 3) $c_{skip}(t)=\frac{\sigma_{data}^{2}}{(t-\epsilon )^{2}+\sigma_{data}^{2}},c_{out}(t)=\frac{\sigma_{data}(t-\epsilon )}{\sqrt{\sigma_{data}^{2}+t^{2}}}$
  - $\sigma_{data}$ : balance parameter
- PF-ODE trajectory의 point는 다음과 같이 sampling 됨:
  (Eq. 4) $\mathbf{x}_{t}=\mathbf{x}_{\epsilon}+t_{n}\mathbf{z}$
  - $\mathbf{z}\sim\mathcal{N}(0,I)$, $t_{n}$ : time-step
- 이때 $t_{n}$은 noise level로 볼 수 있고, 다음과 같이 얻어짐:
  (Eq. 5) $t_{n}=\left[\epsilon ^{\frac{1}{\rho}}+\frac{n-1}{N-1}(T^{\frac{1}{\rho}}-\epsilon ^{\frac{1}{\rho}})\right]^{\rho}$
- 결과적으로 PF-ODE의 sampling trajectory를 따르는 모든 point $p_{t}(\mathbf{x})$는 original data distribution point $p_{0}(\mathbf{x})$와 directly associate 되어 있으므로 one-step generation이 가능함

3. Method

- Model Architecture

Encoder
- 논문은 DiffSinger의 encoder structure를 기반으로 music score를 score condition sequence $C_{m}$으로 변환함
- Lyrics encoder와 $N$ Feed-Forward Transformer block은 phoneme ID를 linguistic sequence로 변환함
  1. Pitch embedding sequence는 pitch encoder를 통해 pitch ID로부터 생성됨
  2. Duration predictor는 linguistic sequence를 mel-spectrogram domain의 sequence로 project 함
  3. 최종적으로 encoder는 linguistic, pitch sequence를 music score condition sequence $C_{m}$으로 binding 함
Supplementary Decoder
- 논문은 FastSpeech2의 mel-spectrogram decoder를 supplementary decoder로 활용함
- 구체적으로 decoder는 Feed-Forward Transformer를 기반으로 구성됨
  1. 각 layer는 self-attention sublayer block과 convolutional sublayer block을 가짐
  2. Sublayer는 residual connection, layer normalization, dropout을 포함함
CM-Denoiser
- CM-Denoiser는 Gaussian noise에서 ground-truth mel-spectrogram을 restore 함
- 이때 DiffWave에서 사용된 non-causal WaveNet을 채택함
Scorer
- Scorer는 optimal denoising level $op$를 얻기 위해 training 중에 few reconstruction sample을 reference ground-truth sample과 비교함
- 이를 위해 논문은 Frechet Audio Distance (FAD)를 reference score로 사용함
Time Step Processing
- Sinusoidal position embedding을 사용하여 time step $t$를 continuous hidden condition $C_{t}$로 transform 함
Vocoder
- Final stage에서는 vocoder를 사용하여 CM-Denoiser를 통해 생성된 mel-spectrogram을 perceptible waveform으로 변환함

- Initial Version

Training 시 ConSinger는 $t$-level noise-add mel-spectrogram $\mathbf{x}_{t}$를 통해 $C_{t}, C_{m}$에 기반한 ground-truth $\mathbf{x}$를 predict 함
- 그러면 loss function은:
  (Eq. 6) $\mathcal{L}(\theta)=||\mathbf{x}-f_{\theta}(\mathbf{x}_{n},m,t_{n})||^{2}$
- Importance Sampler
  1. 논문은 (Eq. 5)의 time step $t_{n}$을 구하기 위해 sampler를 도입함
  2. 이때 formulation은 $h_{n}=(1-\lambda)\frac{L(n)}{\sum_{i=2}^{N}L(i)}+\lambda$와 같음
    - Loss table $L(\cdot)$ : 각 point에서 cummulative average loss를 record 하고 sampling point selection을 guide 함
    - $\lambda$ : equilibrium parameter (random/importance sampling을 adjust 하는 역할)
  3. 결과적으로 model은 각 sampling point에서 10 sample을 구한 다음, importance sampler를 통해 randomness를 reduce 함
- 추론 시에는 $T$-level Gaussian noise distribution $\mathcal{N}(0,T^{2}I)$에서 sampling을 수행한 다음, ground-truth mel-spectrogram을 predict 함

- Use Supplementary Decoder

$T$ times standard Gaussian noise를 one-step restore의 starting point로 사용하는 것은 적합하지 않음
- 따라서 Shallow Diffusion Mechanism을 따라 ConSinger에 더 많은 prior knowledge를 제공함
- 먼저 data sample $\mathbf{x}$와 supplementary decoder에 의해 생성된 $\tilde{\mathbf{x}}$가 주어졌을 때, $\mathbf{x},\tilde{\mathbf{x}}$의 conditional distribution은:
  (Eq. 7) $q(\mathbf{x}_{t}|\mathbf{x})=\mathcal{N}(\mathbf{x}_{t};\mathbf{x},t^{2}I)$
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, q(\tilde{\mathbf{x}}_{t}|\tilde{\mathbf{x}})=\mathcal{N}(\tilde{\mathbf{x}}_{t};\tilde{\mathbf{x}},t^{2}I)$
- 두 Gaussian distribution에 대한 KL-Divergence는:
  (Eq. 8) $D_{KL}(\mathcal{N}(\mathbf{x}_{t}) || \mathcal{N}(\tilde{\mathbf{x}}_{t}))=\frac{1}{2}\left[ tr\left(\tilde{\Sigma}^{-1}\Sigma\right)+ (\tilde{\mu}-\mu)^{\top}\tilde{\Sigma}^{-1}(\tilde{\mu}-\mu)-d+\ln \left(\frac{\det\tilde{\Sigma}}{\det\Sigma}\right)\right]=\frac{||\mathbf{x}-\tilde{\mathbf{x}}||^{2}_{2}}{2t^{2}}$
- 여기서 $\tilde{\mu},\mu$를 mean, $\tilde{\Sigma},\Sigma$를 covariance matrix, $d$를 dimension라고 하면:
  (Eq. 10) $\mathbb{E}_{\mathbf{x}\in\mathcal{D}}\left[D_{KL}(\mathcal{N}(\mathbf{x}_{k})|| \mathcal{N}(\tilde{\mathbf{x}}_{k}))\right] =\mathbb{E}_{\mathbf{x}\in\mathcal{D}}\left[\frac{1}{2k^{2}}||\mathbf{x}-\tilde{\mathbf{x}}||^{2}_{2}\right]\leq \mathbb{E}_{\mathbf{x}\in\mathcal{D}}\left[ D_{KL}(\mathcal{N}(\mathbf{x}_{T})||\mathcal{N}(0,T^{2}I) )\right]$
  - 결과적으로 $k$-level noise를 restore point로 가지는 mel-spectrogram은 Gaussian distribution $\mathcal{N}(0,T^{2}I)$보다 나음
  - $k$는 training phase에서 continuously optimize 됨

- Use Scorer to Determine the Optimal Point

Training phase에서는 trajectory에서 optimal restore point $op$를 얻기 위해 scorer를 채택하여 $k$를 replace 함
- 이때 restore quality는 다음의 이유로 noise level과 linear 하지 않음
  1. ConSinger는 input noise, (Eq. 2)의 network output의 weighted combination을 output 하기 때문
  2. (Eq. 3)에서 $c_{skip}$은 $1$에서 $0$으로 transform 되고 $c_{out}$은 $0$에서 $0.5$로 transform 되기 때문
  3. Noise level로 취급할 수 있는 time-step은 (Eq. 5)에서 quasi-exponential이기 때문
- 아래 그림에서, $2$부터 $6$까지 ConSinger는 input mel-spectrogram $\tilde{\mathbf{x}}_{t}$의 proportion을 gradually reduce 함
  - 이는 supplementary decoder에 의해 output 되는 noise impact가 continuously reduce 됨을 의미
- 한편으로 network output proportion이 증가하면 generation quality가 향상될 수 있음
  1. Model의 denoising ability가 weak 한 경우 (low mixing ratio) supplementary decoder에서 unknown noise distribution을 가짐
  2. $7$에서 $15$로 진행될수록 $\tilde{\mathbf{x}}$의 mixed Gaussian noise가 sharply increase 하지만 model은 $\tilde{\mathbf{x}}_{t}$를 overly trusting 하여 reconstruction ability가 저하됨
  3. $16$에서 $37$의 경우, mixed result에서 network output이 증가함에 따라 restore result가 gradually stable 되지만, $\tilde{\mathbf{x}}_{t}$의 excessive noise로 인해 $2\text{~}6$ 수준의 quality는 달성하지 못함

Time step 별 (a) Mel-spectrogram Quality (b) Imporatant Parameter

4. Experiments

- Settings

Dataset : PopCS
Comparisons : DiffSinger, FFT-Singer

- Results

전체적으로 ConSinger의 성능이 가장 뛰어남

Ablation Study
- Importance Sampler (IS), consistency constraint (noise)를 제거하는 경우 성능 저하가 발생함

'Paper > SVS' 카테고리의 다른 글

[Paper 리뷰] TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching (0)	2025.06.01
[Paper 리뷰] Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with Speech Reference (0)	2025.05.16
[Paper 리뷰] SPSinger: Multi-Singer Singing Voice Synthesis with Short Reference Prompt (0)	2025.04.24
[Paper 리뷰] PriorSinger: Singing Voice Synthesis Model with Prior Condition Cross Attention (0)	2025.03.21
[Paper 리뷰] TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control (0)	2024.11.30

최근에 올라온 글

최근에 달린 댓글

« 2025/12 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps

ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps

1. Introduction

2. Background

3. Method

- Model Architecture

- Initial Version

- Use Supplementary Decoder

- Use Scorer to Determine the Optimal Point

4. Experiments

- Settings

- Results

'Paper > SVS' 카테고리의 다른 글

티스토리툴바