[Paper 리뷰] VISinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis

티스토리 뷰

Paper/SVS

[Paper 리뷰] VISinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis

feVeRin 2023. 10. 9. 13:56

VISinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis

가사와 악보로부터 가창 음성을 직접 생성하는 End-to-End Singing Voice Synthesis (SVS) 모델
Normalizing flow 기반 VAE를 채택한 End-to-End Text-to-Speech (TTS) 모델인 VITS를 활용
VISinger
- Phoneme-level 평균, 분산 대신 Length regulator, Frame prior network를 사용하여 노래의 음향 변화를 모델링
- F0 predictor를 통한 안정적인 가창 음성 합성
- 리듬감 향상을 위한 Duration predictor의 수정
논문 (ICASSP 2022) : Paper Link

1. Introduction

SVS 작업은 가사와 악보로 부터 가창 음성을 합성
- 합성된 노래는 가사에 따라서 정확하게 발음되어야 하고 악보의 label과도 일치해야 함
- 노래는 음향 특징의 변화가 많고, vibrato와 같이 발음 방식이 다르기 때문에 합성이 어려움
일반적인 Two-stage SVS 모델은 acoustic model과 vocoder로 구성됨
- Acoustic model은 가사와 악보로부터 intermediate acoustic feature를 생성
- Vocoder는 acoustic feature를 waveform으로 변환

BUT, two-stage 방식은 acoustic model과 vocoder의 독립적인 학습으로 인해 학습과 추론 단계의 불일치가 발생
- Vocoder는 학습과정에서 ground truth intermediate acoustic feature를 사용하지만, 추론 과정에서는 predicted intermediate acoustic feature를 사용하기 때문
  - Ground truth와 예측값 사이의 분포 불일치가 발생
- 불일치 문제를 해결하는 가장 쉬운 방법은 모델을 End-to-End 방식으로 통합하는 것
  - Acoustic model과 Vocoder를 하나로 통합하거나 mel-spectrogram대신 새로운 latent representation을 활용하여 ground truth와 예측을 동일한 분포로 제한
  - 대표적으로 TTS의 FastSpeech2s, VITS, Glow-WaveGAN 등
- 특히 VITS는 VAE를 활용하여 acoustic model과 vocoder를 연결하고 normalizing flow와 adverasrial training을 도입해 자연스러운 음성을 생성 가능

-> 그래서 two-stage 방식의 불일치 문제를 완화하기 위해 VITS를 기반으로 하는 End-to-End SVS 모델인 VISinger를 제안

VISinger
- Two-stage 모델의 불일치 문제를 해결하는 End-to-End SVS 모델
- 가창 음성 합성에 VITS를 도입
  - Frame-level 평균, 분산을 활용하여 노래의 풍부한 음향 변화와 자연스러움을 모델링
  - 자연스러운 intonation을 위해 frame prior network를 guiding
  - Phonem-to-note ratio를 정확히 예측하여 singing note normalization을 도움

< Overall of VISinger >

VITS의 phoneme-level feature를 대체하는 length regulator, frame prior network의 활용
Intonational rendering을 위한 F0 predictor의 도입
노래의 리듬감 전달을 위한 duration predictor의 수정

2. Method

VITS를 기반으로 모델을 conditional Variational AutoEncoder (cVAE)로 공식화
- cVAE는 posterior encoder, prior encoder, decoder의 3 부분으로 구성
  1. Posterior encoder는 waveform $y$에서 latent representation $z$를 추출
    - $z = Enc(y) \sim q(z|y)$
  2. Decoder는 $z$에 따라 waveform $\hat{y}$를 reconstruction
    - $\hat{y} = Dec(z) \sim p(y|z)$
  3. Prior encoder는 악보에 대한 condition $c$가 주어졌을 때 latent $z$의 사전 분포 $p(z|c)$를 얻기 위해 사용
- cVAE는 reconstruction objective $L_{recon}$과 prior regularization term을 채택
  : $L_{cvae} = L_{recon} + D_{KL}( q(z|y) || p(z|c)) + L_{ctc}$
  - $D_{KL}$ : Kullback-Leibler Divergence, $L_{ctc}$ : connectionist temporal classification loss
  - Reconsturction Loss는 ground truth와 generated wavefrom 사이의 L1 distance를 사용

- Posterior Encoder

Posterior encoder는 waveform $y$를 latent $z$로 encoding
- Linear spectrum extractor를 encoder의 fixed signal processing layer로 취급
  - Signal processing layer를 통해 raw waveform을 linear spectrum으로 변환
- VITS와 동일하게,
  1. Linear spectrum을 입력으로 사용하여 WaveNet residual block을 통해 hidden vector를 추출하고
  2. Linear projection을 통해 사후 분포 $p(z|y)$의 평균과 분산을 생성한 다음,
  3. Reparameterization trick을 통해 $p(z|y)$에서 sampling된 latent $z$를 얻음

- Decoder

Decoder는 추출된 intermediate representation $z$에 따라서 audio waveform을 생성
- 전체 길이가 아닌 slice된 $z$를 decoder에 전달하여 audio segment를 생성
- Reconstruction된 음성의 품질을 향상하기 위해 GAN 기반의 training을 수행
  - Discriminator $D$는 HiFi-GAN의 Multi-Period Discriminator (MPD)와 Multi-Scale Discriminator (MSD)를 따름
- Generator $G$와 Discriminator $D$의 GAN loss
  - $L_{adv}(G) = E_{(z)}[(D(G(z))-1)^{2}]$
  - $L_{adv}(D) = E_{(y,z)} [(D(y)-1)^{2}+(D(G(z)))^{2}]$
- 안정적인 학습을 위해 추가적으로 feature matching loss $L_{fm}$을 도입
  - $L_{fm}$은 각 discriminator의 intermediate layer에서 추출된 feature map 간의 L1 distance를 최소화

- Prior Encoder

악보 condition $c$가 주어지면 prior encoder는 cVAE의 prior regularization term에 사용되는 사전 분포 $p(z|c)$를 제공
- Text encoder는 악보를 입력으로 받아 phoneme level representation을 생성
- $z$의 frequency를 일치시키기 위해, FS2의 length regulator를 사용
  - Phoneme level representation을 frame-level representation $h_{text}$로 확장
- 노래의 음향 변화가 크고 frame들이 서로 다른 분포를 따를 수 있으므로, frame prior network를 추가
  - 평균 $\mu_{\theta}$, 분산 $\sigma_{\theta}$를 사용하여 세분화된 사전 정규 분포를 생성
- 사전 분포의 표현력을 향상하고 latent $z$에 대한 더 많은 supervisory information을 제공하기 위해 normalizing flow $f_{\theta}$와 phoneme predictor가 추가됨
  - Phoneme predictor는 두 개의 FFT layer로 구성되고 모듈의 출력은 phoneme과 함께 CTC loss를 사용해 계산
  - $p(z|c) = N(f_{\theta}(z); \mu_{\theta}(c), \sigma_{\theta}(c)) |det \frac{\partial f_{\theta}(z)} {\partial_{z}}|$
Text Encoder
- 악보에는 가사, note duration, note pitch 등이 포함됨
  - 가사는 phoneme sequence로 변환
  - Note duration은 각 note 해당하는 frame 수
  - Note pitch는 MIDI 표준에 따른 pitch ID
- Note duration sequence와 note pitch sequence는 phoneme sequence 길이만큼 확장
- 위 세 가지 sequence를 입력으로 하여 여러 개의 FFT 블록을 통해 악보의 phoneme-level representation을 생성
Length Regulator
- Length Regulator (LR) 모듈은 phoneme-level representation을 frame-level representation으로 확장 ($h_{text}$)
- 학습과정에서는 각 phoneme에 해당하는 ground truth duration $d$가 사용되고, 추론과정에서는 predicted duration $\hat{d}$를 사용
- Duration predictor는 여러 개의 1D convolution으로 구성
  - VITS와 달리 stochastic duration predictor를 사용하지 않는 대신, 악보에서 note duration $d_{N}$을 기반으로 duration prediction을 수행
  - Note Normalization (Note Norm.) : Duration predictor가 해당하는 note duration에 대한 phoneme duration의 비율 $r$을 예측하는 것
- Duration Loss
  : $L_{dur} = || r \times d_{N} - d ||_{2}$
  - $\hat{d} = r \times d_{N}$ : $r$과 $d_{N}$의 곱은 합성 단계에서 Length Regulator를 guide 하는 예측된 frame 개수 $\hat{d}$
  - 예측된 phoneme duration $\hat{d}$는 악보의 label과 일치
Frame Prior Network with F0 Predictor
- VITS에서 text encoder는 phoneme-level text information을 추출하여 latent $z$를 추출하기 위한 prior knowledge의 역할을 수행
  - SVS 작업에서는 음향의 변화가 다양하므로 phoneme의 평균, 분산 만으로는 충분하지 않음
  - 결과적으로 VITS로 합성된 audio에는 발음이 잘못된 단어가 많이 발생
- Frame prior network를 사용하여 사전 정보를 풍부하게 제공
  - Frame prior network는 frame-level 평균 $\mu_{\theta}$와 분산 $\sigma_{\theta}$를 얻기 위해 frame-level sequence에 대한 post-preocessing을 수행
  - 여러 개의 1D convolution으로 구성
- 특히 intonational rendering은 자연스러운 노래 생성을 위해 필수적
  - 이를 위해 frame prior network를 guide 할 수 있는 F0 information을 도입
  - F0는 추론 과정에서 여러 개의 FFT 블록으로 구성된 F0 predictor로부터 얻어짐
  - LF0 loss : $L_{LF0} = || L\hat{F}0-LF0||_{2}$
  - $L\hat{F}0$ : 예측된 LF0
Flow
- VISinger는 VITS의 flow decoder 구조를 따름
- Flow는 사전 분포의 flexibility를 향상하기 위해 multiple affine coupling layer로 구성
  - Flow는 정규 분포를 보다 일반적인 분포로 변환하고, 역변환을 가능하게 함
- 학습과정에서 latent $z$는 flow를 통해 $f(z)$로 변환됨
- 추론과정에서 역변환을 통해 frame prior network의 출력을 latent $\hat{z}$로 변환

- Final Loss

cVAE와 adversarial training을 통해 VISinger를 최적화
- $L = L_{adv}(G) + L_{fm}(G) + L_{cvae} + \lambda L_{dur} + \beta L_{LF0}$
- $L(D) = L_{adv}(D)$
$L_{adv}(G)$, $L_{adv}(D)$ : 각각 $G$, $D$의 GAN loss
$L_{fm}$ : feature matching loss
$L_{cvae}$ : reconstruction loss, KL loss, CTC loss로 구성

3. Experiments

- Settings

Datasets : 100곡으로 구성된 4.7시간짜리 가창 데이터셋 (자체 녹음)
Comparisons : FastSpeech, VITS

- Experiments Results

VISinger가 F0 MAE와 Dur MAE 측면에서 가장 우수한 음질을 보임
- 특히 VISinger는 End-to-End 방식이기 때문에 two-stage 방식인 FastSpeech 보다 작은 모델 크기를 가짐

VISinger가 생성한 spectrum을 시각화했을 때, 고주파수 영역의 구성이 더 명확하고 ground truth specturm과 가장 비슷함
- VISinger는 더 뚜렷한 발음을 가진 가창 음성을 합성 가능

- Ablation Study

Phoneme predictor, F0 predictor, Frame prior network를 각각 제거했을 때 음성 품질 비교
- 각 모듈들을 제거했을 때, MOS가 전반적으로 떨어짐
- Frame prior network의 layer를 늘리더라도 성능 향상으로 이어지지는 않음

Note normalizing의 효과를 알아보기 위해 ground truth와 예측 간의 phoneme-level duration의 차이를 계산
- Note normalizing을 사용했을 때, 더 정확하게 duration을 예측하고 편차가 작음

'Paper > SVS' 카테고리의 다른 글

[Paper 리뷰] Singing Voice Synthesis based on a Musical Note Position-aware Attention Mechanism (0)	2024.02.29
[Paper 리뷰] Singing Voice Synthesis Using Differentiable LPC and Glottal-Flow-Inspired Wavetables (0)	2024.01.20
[Paper 리뷰] LiteSing: Towards Fast, Lightweight and Expressive Singing Voice Synthesis (0)	2024.01.09
[Paper 리뷰] UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis (0)	2024.01.04
[Paper 리뷰] DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (0)	2023.08.15

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] VISinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis

VISinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis

1. Introduction

2. Method

- Posterior Encoder

- Decoder

- Prior Encoder

- Final Loss

3. Experiments

- Settings

- Experiments Results

- Ablation Study

'Paper > SVS' 카테고리의 다른 글

티스토리툴바