[Paper 리뷰] SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-Filter Model

티스토리 뷰

Paper/SVS

[Paper 리뷰] SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-Filter Model

feVeRin 2024. 5. 3. 10:19

SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-Filter Model

High-fidelity human-like singing voice synthesis를 위해 source-filter mechanism을 활용할 수 있음
SiFiSinger
- VITS에서 확장된 training paradigm을 활용하고 fundamental pitch ($F0$) predictor, waveform decoder 등의 component를 통합
- Interwined mel-spectrogram과 $F0$ characteristic을 decouple하기 위해 mel-cepstrum feature를 활용
- Pitch nuance를 보다 정확하게 capture할 수 있도록 $F0$ representation으로써 source excitation signal을 도입
- 추가적으로 생성된 음성의 speech envelope와 pitch에 대한 prediction accuracy를 fortify 하기 위해 differentiable mel-cepstrum, $F0$ loss를 사용
논문 (ICASSP 2024) : Paper Link

1. Introduction

Singing voice synthesis (SVS)는 노래 가사와 musical score를 바탕으로 가창 음성을 합성하는 것을 목표로 함
- 일반적으로 SVS system은 2-stage 방식으로 구성됨
  - Musical score로 부터 lyrical, musical information을 추출하고 mel-spectrogram과 같은 acoustic feature를 예측하는 acoustic model
  - 해당 feature를 audible waveform으로 변환하는 vocoder
- 한편으로 VISinger와 같은 end-to-end 방식은 주어진 musical score로부터 waveform을 직접 합성함
  1. 이때 VITS architecture를 기반으로 variational autoencoder (VAE)-based posterior encoder, prior encoder, adversarial decoder를 활용
  2. 확장된 VISinger2는 length regulator, $F0$ predictor를 통해 encoding 과정에서 posterior/prior distribution 모두에 대한 frame-level mean, variance를 모델링
    - 추가적으로 latent representation으로 부터 harmonic, aperiodic component를 모델링하기 위해 Differentiable Digital Signal Processing (DDSP)를 활용
    - DDSP synthesizer로 얻어진 signal은 adversarial decoder에 대한 conditional input 역할을 하여 text-to-phase의 어려움을 해결하고 품질을 향상할 수 있음
- BUT, 위와 같은 end-to-end 방식을 SVS에 적용할 때 다음의 몇 가지 문제점이 존재함
  1. 일반적인 text-to-speech (TTS) 보다 pitch accuracy를 더 중요하게 고려할 수 있어야 함
    - $F0$는 linguistic lyrics보다 musical note에 직접적으로 관련되어 있기 때문
  2. Acoustic feature로 주로 사용되는 mel-spectrogram은 $F0$와 spectral envelope와 entangling 되어 있으므로, error propagation으로 이어짐
    - 즉, $F0$ modeling에 대한 bias가 발생하므로 prediction accuracy에 영향을 미칠 수 있음
  3. DDSP synthesizer는 text-to-phase modeling에 유용하지만, prior hidden vector prediction/subsequent audio generation에서 $F0$ information에 대한 consistent 한 활용이 어려움
    - 결과적으로 합성 품질 저하와 pitch inaccuracy 문제로 이어짐

-> 그래서 source-filter mechanism을 활용하는 end-to-end SVS 모델인 SiFiSinger를 제안

SiFiSinger
- 먼저 source-filter model에서
  - Source는 foundational sound나 waveform을 생성하는 vocal cord의 vibration과 관련되어 있음
  - Filter는 source에서 생성된 음성이 vocal tract를 통해 이동하는 과정으로 볼 수 있음
- 따라서 해당 source-filter model을 기반으로 SiFiSinger에서,
  1. $F0$는 prior encoder의 source module을 통해 처리하여 $F0$에 의해 control되는 multiple harmonics를 생성함
  2. Source module로 처리된 $F0$ excitation은 decoder의 HiFi-GAN generator의 pitch embedding으로 사용되어 waveform 생성에 대한 pitch control을 보장함
  3. 추가적으로 $F0$, phase information에서 decouple 된 spectral envelope information을 capture 하기 위해 mel-cepstrum $\text{mcep}$ feature를 활용
    - 이는 source-filter model의 filter component로 취급할 수 있음
    - 즉, $F0$로 생성된 source excitation signal과 $\text{mcep}$ feature를 concatenating 하여 prior/posterior encoder의 acoustic modeling process를 neural source-filter model로 변형
  4. 이후 generator를 통해 합성된 audio에서 $\text{mcep}$과 $F0$를 re-extract 하기 위해 differentiable method를 적용하고, ground-truth에 대한 loss를 계산
    - 이를 통해 gradient backpropagation을 구현하고, 전체 모델에 대한 source $F0$와 filter $\text{mcep}$를 통해 효과적인 separated supervision을 도움

< Overall of SiFiSinger >

VITS에서 확장된 framework를 기반으로 $F0$ predictor, waveform decoder 등의 component를 통합
Interwined mel-spectrogram과 $F0$ characteristic을 decouple 하기 위해 mel-cepstrum feature를 활용하고 differentiable loss를 적용
결과적으로 기존 방법들 보다 우수한 성능을 달성

2. Method

SiFiSinger는 conditional VAE framework를 기반으로 하는 VITS, VISinger2와 유사하고, prior encoder, posterior encoder, waveform decoder로 구성됨

- Source Module

Source module은 $F0$ sequence $f_{1:T}$를 사용하여 $e_{1:T} = \{e_{1}, ..., e_{T}\}$와 같은 sinusoidal excitation을 생성하도록 설계됨
- $e_{t}\in \mathbb{R}, t\in \{1,...,T\}$이고 $t$는 $t$-th time step을 의미
- 그러면 sinusoidal excitation $e_{1:T}^{<0>}$의 생성은 다음과 같음:
  (Eq. 1) $e_{t}^{<0>}=\left\{\begin{matrix}
  \alpha\sin\left(\sum_{k=1}^{t}2\pi\frac{f_{k}}{N_{s}}+\phi\right)+n_{t}, & \text{if}\,\, f_{t}>0 \\
  \frac{1}{3\sigma}n_{t}, & \text{if}\,\, f_{t}=0 \\
  \end{matrix}\right.$
  - $n_{t} \sim \mathcal{N}(0,\sigma^{2})$, $\phi$ : random initial phase, $N_{s}$ : sampling rate
  - $\alpha$ : source waveform의 amplitude를 adjust 하는 hyperparameter, $\sigma$ : Gaussian noise의 표준편차
- 이때 source module은 $h$-th harmonic overtone이 $(h+1)$-th harmonic frequency에 해당하는 harmonic overtone을 생성함
  - 결과적으로 sinusoidal excitation $e_{1:T}^{<h>}$는 (Eq. 1)의 $(h+1)f_{t}$를 통해 구해짐
- Final step에서 source module은 trainable feed-forward (FF) layer를 사용하여 $e_{1:T}^{<0>}, e_{1:T}^{<h>}$를 merge 함
  - 이러한 $F0$-controlled harmonic generation mechanism을 통해 module은 합성된 음성이 desired pitch와 closely align 되도록 보장하고, 전반적인 SVS 품질과 naturalness를 향상할 수 있음

- Prior Encoder

Prior encoder의 구조는 FastSpeech의 feed-forward transformer (FFT) block과 length regulator를 활용함
- Prior encoder는 duration predictor와 $F0, \text{mcep}$ acoustic decoder를 포함하고, 둘 다 music score를 input으로 사용
- Training 중에 ground-truth duration에 대한 $\text{mcep}$ feature와 $F0$는 다음의 loss function $L_{am}$로 학습된 acoustic decoder로부터 생성됨:
  (Eq. 2) $L_{am}=\lambda_{1}MSE(\text{LF0}, \text{LF0}_{pred})+\lambda2 || \text{mcep}-\text{mcep}_{pred}||_{1}$
  - $\text{LF0}_{pred}, \text{mcep}_{pred}$ : 각각 예측된 $\log\text{-}F0, \text{mcep}$
  - $\lambda_{1},\lambda_{2}$ : coefficient, $MSE(\cdot)$ : mean squared error loss
- SiFiSinger는 $F0$와 entangle되지 않으면서 audio envelope information을 capture하기 위해 $\text{mcep}$ feature를 사용함
  - $F0$는 앞선 source module을 통해 처리되어 rapidly oscillating periodic sinusoidal harmonics를 제공하고, 해당 excitation signal은 $\text{mcep}$ feature와 concatenate됨
  - 이때 human sining의 pronunciation mechanism을 imitate하기 위해 spectral envelope feature와 pitch information을 개별적으로 모델링함
- Duration predictor는 phoneme과 note의 duration을 예측하고 duration loss $L_{dur}$를 계산함
  - 추론 시에 length regulator는 해당 duration predictor의 output을 length reference로 사용
- 결과적으로 music score encoder의 output인 $\text{mcep}$ feature와 acoustic model (AM) source module의 excitation signal을 기반으로, AM decoder는 frame-level prior distribution의 평균, 분산을 예측함
  - 이후 prior hidden vector $z$를 sampling할 수도 있음

- Posterior Encoder

Posterior encoder는 VISinger2를 backbone으로 하여 $N$개의 1D convolution layer와 LayerNorm으로 구성됨
- 이때 SiFiSinger는 $\text{mcep}$ feature와 $F0$를 posterior encoder의 input으로 사용
  - Prior encoder와 비슷하게, frame-level acoustic feature를 고려하여 posterior distribution의 평균과 분산을 예측
  - 이후 posterior latent vector $z$를 얻기 위해 re-sampling procedure를 적용
- Training 중에 해당 posterior $z$와 prior는 KL divergence loss $L_{kl}$을 통해 constrained 됨

- Decoder

Decoder는 latent distribution $z$를 input으로 하여 waveform $\hat{y}$를 생성하는 HiFi-GAN generator로 구성됨
- 먼저 합성된 waveform $\hat{y}$에서 mel-spectrogram을 추출하여 mel-spectrogram loss $L_{mel}$을 계산
- 이때 decoder 내에서 generation process 동안 strong pitch information을 제공하기 위해 source module에서 생성된 excitation signal을 활용
  1. Frame-level $F0$는 먼저 sample-level로 upsampling된 다음, source module을 통과해 excitation signal을 생성함
  2. 이후 점진적으로 downsampling된 다음, generator로 upsampling된 $z$와 combine되어 waveform generation process에 대한 multi-scale pitch information을 제공
- SiFiSinger는 generator $G$에 의해 생성된 waveform $\hat{y}$와 ground-truth $y$를 distinguish하는 discriminator $D$를 기반으로 하는 adversarial learning approach를 채택함:
  (Eq. 3) $L_{adv}(D)=\mathbb{E}_{(y,z)}\left[(D(y)-1)^{2}+(D(G(z)))^{2}\right]$
  (Eq. 4) $L_{adv}(G)=\mathbb{E}_{z}\left[(D(G(z))-1)^{2}\right]$
  (Eq. 5) $L_{fm}(G)=\mathbb{E}_{(y,z)}\left[\sum_{l=1}^{T}\frac{1}{N_{l}}|| D^{l}(y)-D^{l}(G(z))||_{1}\right]$
  - 즉, least-squares loss와 feature-matching loss로 구성됨
  - $T$ : discriminator의 layer 수, $l$ : discriminator의 $l$-th layer, $N_{l}$ : $l$-th layer의 feature 수
- Generator loss $L_{G}$는:
  (Eq. 6) $L_{G}=L_{adv}(G)+\lambda_{mel}L_{mel}+\lambda_{fm}L_{fm}(G)$

- Differentiable Reconstruction Loss

추가적으로 논문에서는 trained CREPE와 diffsptk를 사용하여 generator로 생성된 waveform $\hat{y}$에서 $F0, \text{mcep}$을 differentiable manner로 re-extract 함
- 먼저 CREPE는 raw-audio waveform에서 직접 pitch를 예측하도록 training 된 convolution network를 활용하는 $F0$ estimation method로써 audio pitch의 probability distribution을 output 함
  - 원래의 CREPE model은 예측된 probability를 기반으로 $F0$를 얻기 위해 $\arg\max$나 Viterbi와 같은 decoding을 사용함
  - BUT, 해당 방식에는 gradient backpropagation을 불가능하게 하는 non-differentiable operation이 존재함
- 따라서 input waveform에 대한 gradient backpropagation을 위해, original CREPE의 non-differentiable operation을 재구현함
  1. 먼저 최종 예측된 $F0$ value를 얻기 위해, 해당 frequency scale의 예측 pitch probability distribution에 대한 weighted sum을 수행
    - 여기서 CREPE는 16kHz의 sampling rate로 training 되므로 generator에서 얻어진 waveform $\hat{y}$와 ground-truth $y$를 16kHz로 resampling 하여 $\hat{y}_{rs}, y_{rs}$를 얻음
  2. 다음으로 weighted sum을 수행하여 $\hat{y}_{rs}, y_{rs}$에서 $F0$를 re-extract 하고, 그에 따른 loss를 계산
- $\text{mcep}$ feature의 경우 diffsptk를 사용해 framing, windowing, STFT 등에 대한 differentiable operation을 구현함
  1. 이를 통해 합성된 waveform $\hat{y}$에서 $\text{mcep}$ feature를 differentiably extract 하고 ground-truth $\text{mcep}$을 통해 loss를 계산할 수 있음:
    (Eq. 7) $y_{rs}=Resampler (y), \,\, \hat{y}_{rs}=Resampler(\hat{y})$
    (Eq. 8) $L_{F0}=\lambda_{f0}MSE(\text{F0}(y_{rs}), \text{F0}(\hat{y}_{rs}))$
    (Eq. 9) $L_{mcep}=\lambda_{mcep}||\text{mcep}(y)-\text{mcep}(\hat{y})||_{1}$
    - $\text{F0}(\cdot), \text{mcep}(\cdot)$ : 각각 CREPE, diffsptk를 사용하여 $F0, \text{mcep}$ feature를 re-extract 하는 function
    - $\lambda_{f0}, \lambda_{mcep}$ : coefficient, $Resampler(\cdot)$ : resamping function
  2. 위의 differentiable operation을 통해 reconstruction loss의 gradient를 HiFi-GAN generator에서 SiFiSinger의 다른 module로 backpropagate 가능함
    - 이때 CREPE의 parameter는 fixed 됨
- 결과적으로 전체 training procedure에 대한 total loss는:
  (Eq. 10) $L=L_{G}+L_{kl}+L_{am}+L_{dur}+L_{mcep}+L_{F0}$
  (Eq. 11) $L(D)=L_{adv}(D)$
  - $L(D)$ : (Eq. 3)의 discriminator loss, $L_{G}$ : (Eq. 6)의 generator loss
  - Training 중에는 $L$과 $L(D)$가 alternatley optimize 됨

3. Experiments

- Settings

Dataset : OpenCPop
Comparisons : VISinger2

- Results

Objective Evaluation
- 정량적 지표 측면에서 SiFiSinger는 가장 우수한 성능을 달성함
- 특히 SiFiSinger는 낮은 spectral distortion을 보이므로, spectral information을 보다 정확하게 예측할 수 있음

Pitch contour를 확인해 보면 VISinger2는 부정확한 descending tone을 생성하지만, SiFiSinger는 ground-truth의 pitch contour와 가까운 결과를 생성함

(좌) Ground-truth (중) VISinger2 (우) SiFiSinger의 Pitch Contour

Subjective Evaluation
- MOS 측면에서도 SiFiSinger가 가장 우수한 것으로 나타남
- 한편으로 AM source module을 제거하는 경우 pitch accuracy, harmonic modeling 측면에서 낮은 선호도를 보임
- Differentiable reconstruction loss를 제거하는 경우에도 SiFiSinger의 선호도 하락이 발생함

'Paper > SVS' 카테고리의 다른 글

[Paper 리뷰] Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt (0)	2024.06.22
[Paper 리뷰] MIDI-Voice: Expressive Zero-Shot Singing Voice Synthesis via MIDI-Driven Priors (0)	2024.05.13
[Paper 리뷰] StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis (0)	2024.03.26
[Paper 리뷰] Singing Voice Synthesis based on a Musical Note Position-aware Attention Mechanism (0)	2024.02.29
[Paper 리뷰] Singing Voice Synthesis Using Differentiable LPC and Glottal-Flow-Inspired Wavetables (0)	2024.01.20

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-Filter Model

SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-Filter Model

1. Introduction

2. Method

- Source Module

- Prior Encoder

- Posterior Encoder

- Decoder

- Differentiable Reconstruction Loss

3. Experiments

- Settings

- Results

'Paper > SVS' 카테고리의 다른 글

티스토리툴바