[Paper 리뷰] X-Singer: Code-Mixed Singing Voice Synthesis via Cross-Lingual Learning

티스토리 뷰

Paper/SVS

[Paper 리뷰] X-Singer: Code-Mixed Singing Voice Synthesis via Cross-Lingual Learning

feVeRin 2024. 11. 2. 10:10

X-Singer: Code-Mixed Singing Voice Synthesis via Cross-Lingual Learning

Singing Voice Synthesis는 여전히 musical score의 annotation에 의존적이고 code-mixed singing voice를 생성하는 데는 한계가 있음
X-Singer
- Phoneme annotation이 없는 code-mixed lyrics로 구성된 music score를 처리하는 music score encoder를 도입
  - Music score encoder는 code-mixed lyrics를 encode하기 위해 language code-switching을 채택하고, phoneme annotation에 대한 의존성을 줄이기 위해 mixture alignment를 활용
- 추가적으로 conditional flow matching-based decoder를 사용하여 합성 품질을 향상
논문 (INTERSPEECH 2024) : Paper Link

1. Introduction

Singing Voice Synthesis (SVS)는 lyrics와 note information이 포함된 phoneme-level annotated musical score를 input으로 하여 acoustic feature를 예측함
- 일반적으로 SVS는 mono-lingual task에서는 우수한 성능을 보이지만 Musical Score (MS)의 phoneme-level annotation에 의존적이고 code-mixed singing voice 측면에서는 한계가 있음
- 한편으로 realistic MS는 아래 그림과 같이 phoneme-level annotation이 없는 code-mixed lyrics로 구성되지만, 다음의 한계점을 가짐:
  1. 대부분의 방식들은 아래 그림의 (b)와 같은 phoneme-level annotation을 사용하여 training됨
    - 따라서 unseen singing voice를 추론하기 위해서는 forced alignment나 heuristic을 통해 realistic MS를 phoneme-level annotation으로 변환해주어야 함
  2. 기존 SVS dataset은 annotation의 어려움으로 인해 multi-lingual song이 부족함
    - 따라서 code-mixed lyrics가 포함된 realistic MS를 처리하기 어려움
  3. 서로 다른 grapheme-/phoneme-based lyrics를 International Phonetic Alphabet (IPA)로 변환하면 imprecise result가 발생하므로 lyrics를 unified token으로 represent 하기 어려움

-> 그래서 realistic MS에 대해서도 안정적인 합성이 가능한 code-mixed SVS model인 X-Singer를 제안

X-Singer
- Realistic MS를 처리하기 위해 MS encoder를 도입
  - MS encoder는 code-mixed lyrics와 mixture alignment에 code-switching을 도입하여 phoneme-level annotation에 대한 dependency를 완화
- 추가적으로 합성 품질을 향상하기 위해 Conditional Flow Matching (CFM)-based decoder를 채택

< Overall of X-Singer >

Cross-lingual language learning을 활용한 code-mixed SVS model
결과적으로 기존보다 뛰어난 합성 성능을 달성

2. Method

- Musical Score Encoder

논문은 MS encoder를 사용하여 realistic MS에서 musical score representation을 추출함
- 구조적으로 MS encoder는 lyrics encoder, melody encoder, phoneme-to-phoneme cross-attention을 가짐
- 먼저 lyrics encoder는 feed-forward transformer block과 Mix-Layer Normalization (Mix-LN) transformer block으로 구성됨
  1. 일반적으로 same lyrics를 share 하지 않는 mono-lingual SVS dataset의 mixture로 training 하면 lyrics representation은 singer identity와 associate 될 수 있음
    - 따라서 singer identity 내의 language information을 disentangle 하여 code-mixed lyrics representation을 추출하는 것이 필요함
  2. 이를 위해 논문은 singer에 대한 bias를 줄이고 unseen scenario에 대한 generalization을 향상할 수 있는 Mix-LN transformer를 채택함
  3. 해당 Mix-LN은 speaker embedding의 feature statistics를 mix 하여 mismatched speaker information을 생성해 model을 confuse 함:
    (Eq. 1) $Mix-LN(h,espk)=γmix(espk)h−μσ+βmix(espk)<math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>Mix-LN</mtext><mo stretchy="false">(</mo><mi>h</mi><mo>,</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><msub><mi>γ</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub><mo stretchy="false">)</mo><mfrac><mrow><mi>h</mi><mo>−</mo><mi>μ</mi></mrow><mi>σ</mi></mfrac><mo>+</mo><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub><mo stretchy="false">)</mo></math>$
    (Eq. 2) $γ m i x (e s p k) = λ γ (e s p k) + (1 - λ) γ (˜ e s p k) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>γ</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mi>λ</mi><mi>γ</mi><mo stretchy="false">(</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub><mo stretchy="false">)</mo><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>λ</mi><mo stretchy="false">)</mo><mi>γ</mi><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>e</mi><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub><mo stretchy="false">)</mo></math>$
    (Eq. 3) $β m i x (e s p k) = λ β (e s p k) + (1 - λ) β (˜ e s p k) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mi>λ</mi><mi>β</mi><mo stretchy="false">(</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub><mo stretchy="false">)</mo><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>λ</mi><mo stretchy="false">)</mo><mi>β</mi><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>e</mi><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub><mo stretchy="false">)</mo></math>$
    - $μ, σ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>μ</mi><mo>,</mo><mi>σ</mi></math>$ : hidden representation $h <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>h</mi></math>$ 의 평균, 분산
    - $γ, β <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>γ</mi><mo>,</mo><mi>β</mi></math>$ : speaker embedding $e s p k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub></math>$ 를 취하는 simple linear layer
    - $˜ e s p k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mi>e</mi><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub></math>$ : batch axis를 따라 shuffling operation을 수행하여 얻어짐
    - $λ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>λ</mi></math>$ : Beta distribution $λ \sim Beta (α, α) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>λ</mi><mo>\sim</mo><mtext>Beta</mtext><mo stretchy="false">(</mo><mi>α</mi><mo>,</mo><mi>α</mi><mo stretchy="false">)</mo></math>$ 에서 sampling 됨 ( $α = 0.2 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>α</mi><mo>=</mo><mn>0.2</mn></math>$ )
- Lyrics encoder는 positional embedding을 사용하여 lyrics, language embedding sequence를 concatenate함
  1. Concatenation 이전에 논문은 International Phonetic Alphabet (IPA) symbol을 사용하여 lyrics sequence를 변환함
  2. 추가적으로 language code-switching을 위해 phoneme-level language embedding을 사용
- Phoneme-to-note encoder의 경우 note-level average pooling operation을 적용하여 lyrics encoder output을 compress 함
  - Phoneme-to-note encoder는 compressed lyrics representation에서 note-level lyrics representation을 추출하는 역할
- Melody encoder는 musical scroe의 note-related feature (pitch, duration, tempo)를 사용함
  1. 결과적으로 melody encoder는 note-level pitch, duration tempo embedding을 positional embedding과 concatenate 한 다음, note-level meolody representation을 추출함
  2. 이후 note duration에 따라 note-level lyrics representation과 melody representation의 summation을 expand 함
X-Singer는 note에서 actual phoneme boundary를 얻기 위해 PortaSpeech의 mixture alignment를 도입함
- Musical score는 일반적으로 syllable이나 word를 note로 mapping 하므로 note에서 phoneme-level soft alignment를 사용하고 note-level hard alignment는 keep 함
- Mixture alignment를 얻기 위해 논문은 lyrics representation을 key $K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi></math>$ 와 value $V <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>V</mi></math>$ 로 제공하고, frame-level melody representation을 phoneme-to-phoneme cross-attention에 대한 query $Q <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>Q</mi></math>$ 로 제공함
  1. 이때 attention module 이전에 relative positional embedding을 추가하여 attention alignment를 close-to-diagonal로 만듦
  2. 추가적으로 각 note의 phoneme-level representation이 preceding/succeding note에 모두 attend 하도록 하여 note-level alignment의 hardness를 완화함
- Close-to-diagonal을 위해 guided attention loss $L g a <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>g</mi><mi>a</mi></mrow></msub></math>$ 를 채택할 수 있음:
  (Eq. 4) $L g a = E n t [A n t W n t] <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>g</mi><mi>a</mi></mrow></msub><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mrow data-mjx-texclass="ORD"><mi>n</mi><mi>t</mi></mrow></msub><mo stretchy="false">[</mo><msub><mi>A</mi><mrow data-mjx-texclass="ORD"><mi>n</mi><mi>t</mi></mrow></msub><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>n</mi><mi>t</mi></mrow></msub><mo stretchy="false">]</mo></math>$
  (Eq. 5) $W n t = 1 - e - (n / N - t / T) 2 / 2 g 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>n</mi><mi>t</mi></mrow></msub><mo>=</mo><mn>1</mn><mo>-</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>-</mo><mo stretchy="false">(</mo><mi>n</mi><mrow data-mjx-texclass="ORD"><mo>/</mo></mrow><mi>N</mi><mo>-</mo><mi>t</mi><mrow data-mjx-texclass="ORD"><mo>/</mo></mrow><mi>T</mi><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup><mrow data-mjx-texclass="ORD"><mo>/</mo></mrow><mn>2</mn><msup><mi>g</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup></mrow></msup></math>$
  - $A \in R N \times T <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>A</mi><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>N</mi><mo>\times</mo><mi>T</mi></mrow></msup></math>$ : attention matrix
  - $N, T <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi><mo>,</mo><mi>T</mi></math>$ : 각각 lyrics 수, mel-frame 수
  - $g = 0.3 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>g</mi><mo>=</mo><mn>0.3</mn></math>$ 으로 설정

- CFM-based Decoder

Prior Encoder
- Prior Encoder는 musical score representation $h m s <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>s</mi></mrow></msub></math>$ 를 aligned hidden representation $h a l i g n <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>a</mi><mi>l</mi><mi>i</mi><mi>g</mi><mi>n</mi></mrow></msub></math>$ 으로 encode 함
- 여기서 논문은 Conditional Layer Normalization을 사용하여 speaker information을 adapt 함:
  (Eq. 6) $CLN(h,espk)=γ(espk)h−μσ+β(espk)<math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>CLN</mtext><mo stretchy="false">(</mo><mi>h</mi><mo>,</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mi>γ</mi><mo stretchy="false">(</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub><mo stretchy="false">)</mo><mfrac><mrow><mi>h</mi><mo>−</mo><mi>μ</mi></mrow><mi>σ</mi></mfrac><mo>+</mo><mi>β</mi><mo stretchy="false">(</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub><mo stretchy="false">)</mo></math>$
  - $γ, β <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>γ</mi><mo>,</mo><mi>β</mi></math>$ : speaker embedding $e s p k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub></math>$ 에 대한 gain, bias
  - $h <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>h</mi></math>$ : prior encoder의 hidden representation
- Aligned hidden representation $h a l i g n <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>a</mi><mi>l</mi><mi>i</mi><mi>g</mi><mi>n</mi></mrow></msub></math>$ 은 mel-spectrogram과 같은 averaged acoustic feature를 사용하여 Conditional Flow Matching (CFM)-based deocoder를 conditioning 함
- 그러면 hidden representation $h a l i g n <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>a</mi><mi>l</mi><mi>i</mi><mi>g</mi><mi>n</mi></mrow></msub></math>$ 은 다음과 같이 target mel-spectrogram $x <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi></math>$ 를 사용하여 regularize 됨:
  (Eq. 7) $L p = MSE (h a l i g n, x) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msub><mo>=</mo><mtext>MSE</mtext><mo stretchy="false">(</mo><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>a</mi><mi>l</mi><mi>i</mi><mi>g</mi><mi>n</mi></mrow></msub><mo>,</mo><mi>x</mi><mo stretchy="false">)</mo></math>$
Decoder
- Matcha-TTS, P-Flow, VoiceBox의 flow matching을 따라 conditional vector field $u t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>u</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 를 모델링하는 CFM-based decoder를 채택하여 Ordinary Differential Equation (ODE)를 통한 flow를 생성함
  1. 먼저 conditional flow $ϕ t, x 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>ϕ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>,</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></mrow></msub></math>$ 을 target data $x 1 \sim q (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>\sim</mo><mi>q</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></math>$ 와 prior distribution $x 0 \sim N (0, I) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>\sim</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mi>I</mi><mo stretchy="false">)</mo></math>$ 간의 simple linear trajectory로 정의하자:
    (Eq. 8) $ϕ t, x 1 (x) = (1 - (1 - σ min <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>ϕ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>,</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>σ</mi><mrow data-mjx-texclass="ORD"><mo data-mjx-texclass="OP" movablelimits="true">min</mo></mrow></msub><mo stretchy="false">)</mo><mi>t</mi><mo stretchy="false">)</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>+</mo><mi>t</mi><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$
    - $t \in [0, 1]$ : flow에 대한 time step
    - $σ_{min}$ : small white noise의 perturbation을 위한 hyperparameter
  2. 그러면 CFM-based decoder $v_{θ}$ 를 다음의 objective로 training 할 수 있음:
    (Eq. 9) $L_{c f m} = E_{t, q (x), p_{0} (x_{0})} | | u_{t} (ϕ_{t, x_{1}} (x_{0})) - v_{θ} (ϕ_{t, x_{1}} (x_{0}), h_{a l i g n}, e_{s p k}, t) | |^{2}$
    - $h_{a l i g n}$ : prior encoder로 얻은 hidden representation
    - $e_{s p k}$ : speaker embedding
  3. 이후 다음을 solve 하여 target vector field $u_{t}$ 를 얻을 수 있음:
    (Eq. 10) $u_{t} (ϕ_{t, x_{1}} (x_{0})) = \frac{d}{d t} ϕ_{t, x_{1}} (x_{0}) = x_{1} - (1 - σ_{min}) x_{0}$
- 추론 시에는 ODE solver를 사용하여 주어진 predicted vector field $\frac{d}{d t} ϕ_{t, x_{1}} (x_{0}) = v_{θ} (ϕ_{t, x_{1}} (x_{0}), h_{a l i g n}, e_{s p k}, t)$ 와 initial condition $x_{0}$ 에 대한 $ϕ_{1, x_{1}} (x_{0})$ 를 얻음

- Total Loss

결과적으로 X-Singer의 total loss는:
(Eq. 11) $L = L_{p} + L_{c f m} + λ_{g a} L_{g a}$
- $λ_{g a} = 10.0$ 으로 설정

3. Experiments

- Settings

Dataset : Multi-Speaker Singing Dataset, M4Singer, Ofuton-P Database
Comparisons : FFTSinger, DiffSinger

- Results

전체적으로 X-Singer가 가장 우수한 성능을 보임

Ablation Study
- Ablation Study 측면에서 각 component를 제거하는 경우 성능 저하가 발생함

'Paper > SVS' 카테고리의 다른 글

[Paper 리뷰] PriorSinger: Singing Voice Synthesis Model with Prior Condition Cross Attention (0)	2025.03.21
[Paper 리뷰] TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control (0)	2024.11.30
[Paper 리뷰] VISinger2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer (0)	2024.07.24
[Paper 리뷰] PeriodSinger: Integrating Periodic and Aperiodic Variational Autoencoders for Natural-Sounding End-to-End Singing Voice Synthesis (0)	2024.07.17
[Paper 리뷰] MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance (0)	2024.07.15

Let IT Begin Voice Engineer | 심심하면 앨범 리뷰 올립니다

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] X-Singer: Code-Mixed Singing Voice Synthesis via Cross-Lingual Learning

X-Singer: Code-Mixed Singing Voice Synthesis via Cross-Lingual Learning

1. Introduction

2. Method

- Musical Score Encoder

- CFM-based Decoder

- Total Loss

3. Experiments

- Settings

- Results

'Paper > SVS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역