[Paper 리뷰] Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with Speech Reference

티스토리 뷰

Paper/SVS

[Paper 리뷰] Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with Speech Reference

feVeRin 2025. 5. 16. 17:57

Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with Speech Reference

Cross-domain singing voice synthesis를 지원할 수 있는 unified framework가 필요함
Everyone-Can-Sing
- Lyrics에 기반한 language content, musical score에 기반한 performance attribute, singing style, vocal technique 등의 multiple aspect control을 지원
- Pre-trained content embedding과 diffusion-based generator를 활용
논문 (ICASSP 2025) : Paper Link

1. Introduction

Singing Voice Synthesis (SVS)는 music score로부터 singing voice signal을 생성하는 것을 목표로 함
- BUT, 기존의 SVS model은 zero-shot setting에서 unseen voice를 합성하는데 한계가 있음
  1. 특히 few second의 brief voice reference가 주어지는 경우 합성 품질이 저하됨
  2. 이를 해결하기 위해 NANSY, NANSY++와 같이 voice content disentanglement 기반의 zero-shot speech synthesis를 고려할 수 있음
- 한편으로 Singing Voice Conversion (SVC)는 SVS와 달리 music score input을 사용하지 않고, 기존 singing sample의 content를 preserve 하면서 singer voice를 변경하는 것을 목표로 함
  - BUT, SVC 역시 SVS와 마찬가지로 zero-shot setting에서 expressiveness를 효과적으로 반영하지 못함

-> 그래서 short speech audio를 기반으로 expressive zero-shot cross-domain SVS, SVC를 지원할 수 있는 Everyone-Can-Sing을 제안

Everyone-Can-Sing
- Voice timbre transfer를 중심으로 additional input condition을 통해 prosody, style을 control
- Pre-trained disentangled representation을 활용하고 pitch curve, pronunciation 등의 expressive performance attribute를 granular representation으로 conditioning
- Singing, speech dataset을 기반으로 pre-training, fine-tuning, mixed-training strategy를 incorporate

< Overall of Everyone-Can-Sing >

Linguistic content, performance attribute, singing style 등의 fine-grained disentanglement를 기반으로 한 zero-shot cross-domain SVS, SVC model
결과적으로 기존보다 우수한 성능을 달성

2. Method

- Unified Framework

Everyone-Can-Sing은 하나의 zero-shot SVS model과 2개의 zero-shot SVC model을 활용함
- 여기서 논문은 singing을 3가지 component로 break down함:
  1. Musical perspective attribute (Expressive performance control attribute)
    - Singer가 personal style, emotion을 반영하여 music score를 interpret 하는 방법을 capture 함
    - 이를 위해 pitch contour를 위한 fundamental frequency $F0$ curve, dynamics를 위한 amplitude envelope를 활용
    - Performance timing은 각각의 time frame에 embed 됨
  2. Lyrics에 대한 pronunciation
    - Phoneme timing을 포함한 embedding vector의 time sequence로 represent 됨
  3. Voice timbre
    - Speech, singing에서 timbre information을 capture 하는 time-independent embedding
    - 이때 performance control은 lyrics pronunciation이나 timbre에 따라 달라질 수 있으므로 conditional input으로 style token을 add 함
- 각 component는 separately modeling 되고 training 중에 component 간의 disentangle을 학습함
- 추론 시 main synthesizer는 3가지 component를 모두 사용하여 output singing을 생성함
  - 구조적으로는 component output으로부터 mel-spectrogram을 생성하는 acoustic model과 mel-to-waveform 변환을 위한 vocoder로 구성됨
- 특히 SVS에서는 performance control attribute가 musical score, style token으로부터 생성되고, SVC에서는 input singing sample에 signal processing을 적용하여 추출됨
  - 추가적으로 SVC에서는 singing sample에서 disentangle 하거나 aligned lyrics를 사용하여 lyrics pronunciation을 얻음

- Zero-Shot Singing Voice Synthesis

논문의 zero-shot SVS model은 각 note의 pitch, duration을 포함하는 symbolic score, genere/technique를 indicate 하는 style token, score와 align 되는 lyrics, 5-second speech reference를 input으로 사용함
- 이를 기반으로 speech reference timbre와 match 되면서 score, style control을 adhere 하는 singing audio waveform을 output 함
- 구조적으로는 ExpressiveSinger를 기반으로 다음의 modification을 반영함:
  1. Symbolic singer ID 대신 pre-trained voice encoder인 Resemblyzer를 채택
  2. Transformer, fully connected layer 이후의 pronunciation content encoder에 Leaky ReLU를 적용
- 특히 논문은 trained module을 performance timing, $F0$, amplitude control을 위해 directly adapt 함
  1. 생성된 performance timing은 $F0$, amplitude module에 전달되고, content encoder phoneme과 align 됨
    - 이때 논문은 pre-trained speaker embedding model로 Resemblyzer를 사용하고 BigVGAN을 main synthesizer vocoder로 사용함
  2. 결과적으로 논문은 pronunciation content encoder와 main acoustic model만 training 하면 됨
    - 이는 diffusion-based procedure와 reconstruction loss를 통해 수행됨
- 추가적으로 $1:1$ 비율의 singing, speech data로 mixed training을 수행함
  1. Training 시 $F0$, amplitude, voice target embedding은 ground-truth example로부터 추출됨
  2. 추론 시에는 voice target을 unseen speech reference로 replace 함
    - 이때 singing, speech 간의 potential pitch range mismatch를 처리하기 위해 target music score를 speech reference의 1 octave 내로 shift 하는 pitch adjustment를 적용함

- Zero-Shot SVC Given Lyrics Alignment

논문의 SVC module은 sining sample, unseen speech reference, pronunciation content encoder에 input 된 phoneme-level lyrics를 사용하여 converted singing을 생성함
- 이때 aligned lyrics는 recognition & alignment model을 통해 추출될 수 있지만, low accuracy를 가지므로 annotated dataset을 사용함
- 해당 SVC module은 SVS model을 modified version으로써, singing sample에서 $F0$ curve와 amplitude envelope를 추출함
  - 이러한 modification은 training에 영향을 주지 않으므로 further adjustment 없이 trained SVS module을 reuse 할 수 있음

- Zero-Shot SVC with Local Content Embedding

Content embedding을 사용하는 경우 singing sample과 unseen reference만 필요함
- 여기서 pronunciation embedding은 aligned lyrics가 필요 없도록 GR0 content encoder를 통해 singing sample로부터 추출됨
  - 해당 encoder는 CTC loss 기반의 pre-trained Wav2Vec 2.0을 채택하므로 timbre information은 포함되지 않음
- 결과적으로 논문은 SVS lyrics content encoder를 GR0 embedding으로 replace 하여 synthesize를 위한 acoustic model을 training 함
  - 이때 GR0 content encoder는 extensive speech dataset을 통해 pre-train 되어 있으므로 pronunciation component를 효과적으로 disentangle 할 수 있음

3. Experiments

- Settings

Dataset : LibriTTS
Comparisons : ExpressiveSinger

- Results

전체적으로 Everyone-Can-Sing의 성능이 가장 우수함

Ablation Study
- 다양한 timbre, style variation에 대해서도 Everyone-Can-Sing은 뛰어난 합성 성능을 달성함

'Paper > SVS' 카테고리의 다른 글

[Paper 리뷰] CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System based on Conditional Variational Autoencoder (0)	2025.06.03
[Paper 리뷰] TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching (0)	2025.06.01
[Paper 리뷰] ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps (0)	2025.05.02
[Paper 리뷰] SPSinger: Multi-Singer Singing Voice Synthesis with Short Reference Prompt (0)	2025.04.24
[Paper 리뷰] PriorSinger: Singing Voice Synthesis Model with Prior Condition Cross Attention (0)	2025.03.21

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with Speech Reference

Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with Speech Reference

1. Introduction

2. Method

- Unified Framework

- Zero-Shot Singing Voice Synthesis

- Zero-Shot SVC Given Lyrics Alignment

- Zero-Shot SVC with Local Content Embedding

3. Experiments

- Settings

- Results

'Paper > SVS' 카테고리의 다른 글

티스토리툴바