[Paper 리뷰] TCSinger2: Customizable Multilingual Zero-Shot Singing Voice Synthesis

티스토리 뷰

Paper/SVS

[Paper 리뷰] TCSinger2: Customizable Multilingual Zero-Shot Singing Voice Synthesis

feVeRin 2025. 6. 27. 12:47

TCSinger2: Customizable Multilingual Zero-Shot Singing Voice Synthesis

기존의 Singing Voice Synthesis는 다양한 prompt를 통한 multi-level style control이 부족함
TCSinger2
- Blurred Boundary Content Encoder를 통해 duration을 predict 하고, content embedding을 extend 하여 smooth transition을 지원
- Custom Audio Encoder를 통해 singing, speech, textual prompt에서 aligned representation을 추출
- 추가적으로 Flow-based Custom Encoder를 활용하여 style modeling을 향상
논문 (ACL 2025) : Paper Link

1. Introduction

Zero-shot Singing Voice Synthesis (SVS)는 audio/textual prompt를 기반으로 high-quality singing voice를 생성하는 것을 목표로 함
- 특히 multi-level style control을 위해 textual prompt는 emotional expression, segement-/word-level technique에 대한 broader singing style에 영향을 줄 수 있음
  - Audio prompt의 경우 accent, pronunciation을 incorporate 함
- BUT, 기존의 SVS model은 high-controllability의 singing voice를 생성하는데 한계가 있음
- 결과적으로 customizable multilingual zero-shot SVS를 위해서는 다음의 문제를 해결해야 함:
  1. 기존 SVS model은 phoneme, note boundary annotation에 크게 의존함
    - 이로 인해 zero-shot scenario에서 phoneme, note 간의 poor transition이 나타남
  2. 기존의 SVS model은 multi-level style control 측면에서 한계가 있음

-> 그래서 multi-style control과 high-quality zero-shot SVS를 지원하는 TCSinger2를 제안

TCSinger2
- Style control을 위해 textual, speech, singing prompt를 활용
- Robust phoneme, note boundary modeling을 위해 Blurred Boundary Content (BBC) Encoder를 도입
- 다양한 prompt로부터 aligned representation을 추출하기 위해 contrastive learning 기반의 Custom Audio Encoder를 활용하고, high-quality synthesis를 위해 Flow-based Custom Transformer를 채택

< Overall of TCSinger2 >

다양한 prompt를 활용하여 singing style control을 지원하는 multilingual zero-shot SVS model
결과적으로 기존보다 뛰어난 성능을 달성

2. Method

- Overview

Target length $T$에 대해, $y_{gt}$를 ground-truth singing voice, $m_{gt}\in\mathbb{R}^{80\times T}$를 mel-spectrogram이라고 하자
- 먼저 Cutstom Audio Encoder는 $m_{gt}$를 $\hat{m}_{gt}$로 compress 하고, 이때 generation process는 $G(\epsilon |C,P)\rightarrow \hat{m}_{pr}\rightarrow m_{gt}$과 같이 주어짐
  - $\epsilon$ : Gaussian noise, $C$ : condition
  - 여기서 $C$는 lyrics $l$과 music score에서 추출된 music notation $n$을 포함하고, $P$는 singing prompt $p_{si}$, speech prompt $p_{sp}$, textual prompt $p_{te}$ 중 하나를 의미함
- 이후 lyrics $l$과 notation $n$은 BBC Encoder에 input 되어 duration을 predict 하고, content embedding을 extend 하고, boundary에 masking을 적용하여 smooth transition을 지원하고 $z_{c}$를 생성함
- Custom Audio Encoder의 경우, contrastive learning을 활용하여 singing, speech, textual prompt에서 consistent representation을 추출함
  1. 특히 audio prompt $p_{a}$ ($p_{si}$ 또는 $p_{sp}$)에서 style transfer를 수행할 때는 style-rich representation $z_{pa}$를 추출함
  2. Textual prompt $p_{te}$를 사용하는 경우, multi-style controlling representation $z_{pt}$로 encode 됨
- 최종적으로 Flow-based Custom Transformer는 $z_{c}, z_{t}, z_{pt}, z_{pa}$ 등을 활용하여 predicted singing voice $y_{pr}$을 생성함

- BBC Encoder

기존의 SVS model은 precise phoneme, note boundary annotation에 의존하지만, 해당 annotated dataset은 상당히 부족하므로 phoneme, note 간의 poor transition과 phoneme, pitch mislearning이 발생함
- 따라서 논문은 zero-shot setting에서 naturalness를 향상하기 위해 Blurred Boundary Content (BBC) Encoder를 도입함
  1. 먼저 lyrics $l$, note $n$을 separately encoding 한 다음, duration을 predict 하고 content embedding을 extend 하여 precise boundary를 가지는 frame-level sequence $[z_{c1},z_{c1},z_{c2},z_{c2},...,z_{cn}$을 얻음
  2. 이후 각 phoneme에서 $m$ token을 randomly mask 하여 $[z_{c1},\varnothing,z_{c2},z_{c2},\varnothing,...,z_{cn}]$을 생성함
    - $m$을 adjust 하여 supervision과 robustness를 balancing 할 수 있음 (논문에서는 $m=8$로 설정)
- BBC Encoder를 사용하면 blurred boundary를 얻을 수 있고, 이는 이후 Flow-based Custom Transformer에서 fine-grained implicit alignment path를 establish 하는 self-attention mechanism을 통해 refine 됨
- 결과적으로 BBC Encoder는 roughly aligned dataset을 expand 하고 transition naturalness를 improve 하여 zero-shot SVS quality를 개선함

- Custom Audio Encoder

Singing style은 timbre, singing method, emotion, technique 등의 complex factor를 포함하므로, singing voice를 mel로 compress 하면서 rich multi-level style representation을 추출하기 어려움
- 이를 위해 논문은 singing prompt $p_{si}$, speech prompt $p_{sp}$, textual prompt $p_{te}$, content $C$를 기반으로 triplet $(z_{p_{si}},z_{p_{sp}},z_{p_{tc}})$를 추출함
  1. 추가적으로 $z_{p_{si}}$가 singing voice의 integrity를 compromise 하지 않도록 reconstruction을 수행함
  2. 여기서 singing/speech encoder, aduio decoder는 VAE를 기반하고, textual encoder는 music score와 textual prompt를 combine 하기 위해 cross-attention을 적용하여 content와 multi-level style을 추출함
- 특히 contrastive learning을 적용하여 triplet을 align 하고 unified style을 포함하도록 함:
  1. 이때 contrasts는 same content-different style, similar style-different content, different style-different content의 3가지로 구성됨
  2. Training 시에는 다음의 contrastive objective를 사용함:
    (Eq. 1) $\mathcal{L}_{p_{si}^{i},p_{sp}^{i}}=\log \frac{\exp\left(\text{sim}({z_{si}}^{i},{z_{sp}}^{i}) /\tau\right)}{\sum_{j=1}^{N}\exp\left( \text{sim}({z_{si}}^{i},{z_{sp}}^{j})/\tau\right)} + \log\frac{\exp\left(\text{sim}({z_{sp}}^{i},{z_{si}}^{i})/\tau\right)}{\sum_{j=1}^{N}\exp\left(\text{sim} ({z_{sp}}^{i},{z_{si}}^{j})/\tau\right)}$
    - $\text{sim}(\cdot)$ : cosine-similarity
  3. 결과적으로 얻어지는 total loss는 $ \mathcal{L}_{contras}=-\frac{1}{6N}\sum_{i=1}^{N}\left(\mathcal{L}_{p_{si},p_{sp}}+\mathcal{L}_{ p_{sp}^{i},p_{te}^{i}}+\mathcal{L}_{p_{si}^{i},p_{te}^{i}}\right)$과 같음
    - 이를 통해 3가지 embedding은 same space에 align 됨
- Audio Decoder에 대한 training은 $L2$ loss $\mathcal{L}_{rec}$와 LSGAN-style adversarial loss $\mathcal{L}_{adv}$를 사용하여 better reconstruction을 보장함

- Flow-based Custom Transformer

Flow-based Transformer
- 논문은 singing style modeling을 위해 Flow-based Custom Transformer를 도입함
  1. Training 시에는 먼저 encoder output $\hat{m}_{gt}$에 Gaussian noise $\epsilon$을 add 하여 timestep $t$의 $x_{t}$를 얻은 다음,
  2. $x_{t}$를 BBC Encoder의 content embedding $z_{c}$, Custom Audio Encoder의 optional prompt embedding $z_{pa}$ ($z_{p_{si}}$ 또는 $z_{p_{sp}}$)와 concatenate 함
    - 이를 통해 model은 self-attention을 사용하여 content, style transfer를 학습할 수 있음
  3. Textual prompt를 사용하는 경우, $z_{pt}$로 encode 하고 $z_{pa}$ 대신 concatenate 함
  4. 이때 RoPE는 model이 sequential frame 간의 dependency를 capture 할 수 있도록 함
- 결과적으로 각 $t$에서 output vector field는 다음의 flow-matching objective로 training 됨:
  (Eq. 2) $\mathcal{L}_{flow}=\mathbb{E}_{t,p_{t}(x_{t})}\left|\left| v_{t}(x_{t},t|C;\theta)-(\hat{m}_{gt}-\epsilon)\right|\right|^{2}$
  - $p_{t}(x_{t})$ : timestep $t$에서 $x_{t}$의 distribution
- 추가적으로 논문은 first block output을 사용하여 $F0$를 predict 하고 subsequent block에 대한 supervision과 input을 제공함
- 추론 시에는 $\epsilon$이 condition과 combine 되어 fewer timestep으로 target $\hat{m}_{pr}$을 생성함
Cus-MOE
- High-quality multilingual generation과 better style modeling을 위해 TCSinger2는 추가적으로 Cus-MOE (Mixture of Experts)를 도입함
- 구조적으로 Cus-MOE는 linguistic, stylistic condition에 focus 하는 2가지의 expert group으로 구성됨
  1. Lingual-MOE는 lyric language를 기반으로 expert를 select 하고 각 expert는 particular language family에 specialize 되어 있음
    - 특히 각 language family에 대한 generation quality를 향상하기 위해 domain-specific expert를 사용함
  2. Stylistic-MOE는 audio, textual prompt를 기반으로 condition 되어 fine-grained style을 match 함
- Routing strategy로는 dense-to-sparse Gumbel-Softmax를 사용하여 categorical variable을 reparameterize 하여 differentiable sampling과 dynamic routing을 지원함
  1. $h$를 hidden representation, $g(h)_{i}$를 expert $i$에 대한 routing score라고 하자
  2. 이때 overloading을 방지하기 위해 논문은 load-balancing loss를 활용함:
    (Eq. 3) $ \mathcal{L}_{balance}=\alpha N\sum_{i=1}^{N}\left(\frac{1}{B}\sum_{h\in B}g(h)_{i}\right)$
    - $B$ : batch size, $N$ : expert 수, $\alpha$ : regularization strength

- Training and Inference Procedures

Training Procedures
- Pre-trained Custom Audio Encoder와 Decoder에 대한 final loss는 다음을 포함함:
  1. $\mathcal{L}_{contras}$ : contrastive learning을 위한 contrastive objective
  2. $\mathcal{L}_{rec}$ : $L2$ reconstruction loss
  3. $\mathcal{L}_{adv}$ : GAN discriminator의 LSGAN-styled adversarial loss
- TCSinger2의 final loss term은:
  1. $\mathcal{L}_{dur}$ : BBC Encoder의 log-scale MSE phoneme-level duration loss
  2. $\mathcal{L}_{pitch}$ : log-scale MSE pitch loss
  3. $\mathcal{L}_{balance}$ : Cus-MOE의 각 expert group에 대한 load-balancing loss
  4. $\mathcal{L}_{flow}$ : Flow-based Custom Transformer의 flow matching loss
Inference Procedures
- TCSinger2는 input prompt에 기반하여 multiple inference task를 지원함
  1. Unseen singing prompt에 대해서는 zero-shot style transfer를 수행하고, input에 서로 다른 language의 lyrics, singing prompt가 주어지는 경우 cross-lingual style transfer를 수행함
  2. Natural language textual prompt가 주어지면, TCSinger2는 multi-level style control을 수행함
  3. Speech prompt가 제공되는 경우, Speech-to-Singing style transfer를 수행함
- 추가적으로 논문은 Classifier-Free Guidance (CFG)를 incorporate 하여, training 시에는 input prompt를 $0.2$의 probability로 randomly drop 하고 추론 시에는 output vector field를 다음과 같이 수정하여 사용함:
  (Eq. 4) $ v_{cfg}(x,t|C,P;\theta)=\gamma v_{t}(x,t|C,P;\theta)+(1-\gamma)v_{t}(x,t|C,\varnothing;\theta)$
  - $\gamma=3$ : CFG scale
- 최종적으로 TCSinger2는 flow matching의 accelerated inference를 통해 singing voice를 생성함

3. Experiments

- Settings

Dataset : OpenCPop, M4Singer, OpenSinger, PopBuTFy, GTSinger
Comparisons : StyleTTS2, CosyVoice, VISinger2, TCSinger

- Results

전체적으로 TCSinger2의 성능이 가장 우수함

Style Control
- Multi-level style control 측면에서도 TCSinger가 가장 뛰어난 성능을 보임

Mel-spectrogram 측면에서 TCSinger2는 다양한 style을 effectively control 하는 것으로 나타남

Speech-to-Singing
- TCSinger2는 Speech-to-Singing style transfer에서도 우수한 성능을 보임

Ablation Study
- 각 component를 제거하는 경우 성능 저하가 발생함

'Paper > SVS' 카테고리의 다른 글

[Paper 리뷰] ExpressiveSinger: Multilingual and Multi-Style Score-based Singing Voice Synthesis with Expressive Performance Control (0)	2025.06.13
[Paper 리뷰] CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System based on Conditional Variational Autoencoder (0)	2025.06.03
[Paper 리뷰] TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching (0)	2025.06.01
[Paper 리뷰] Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with Speech Reference (0)	2025.05.16
[Paper 리뷰] ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps (0)	2025.05.02

최근에 올라온 글

최근에 달린 댓글

« 2026/03 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] TCSinger2: Customizable Multilingual Zero-Shot Singing Voice Synthesis

TCSinger2: Customizable Multilingual Zero-Shot Singing Voice Synthesis

1. Introduction

2. Method

- Overview

- BBC Encoder

- Custom Audio Encoder

- Flow-based Custom Transformer

- Training and Inference Procedures

3. Experiments

- Settings

- Results

'Paper > SVS' 카테고리의 다른 글

티스토리툴바