[Paper 리뷰] MIDI-Voice: Expressive Zero-Shot Singing Voice Synthesis via MIDI-Driven Priors

티스토리 뷰

Paper/SVS

[Paper 리뷰] MIDI-Voice: Expressive Zero-Shot Singing Voice Synthesis via MIDI-Driven Priors

feVeRin 2024. 5. 13. 10:29

MIDI-Voice: Expressive Zero-Shot Singing Voice Synthesis via MIDI-Driven Priors

기존의 Singing Voice Synthesis 모델은 unseen speaker와 fundamental frequency를 부정확하게 예측하므로 낮은 합성 품질을 보임
MIDI-Voice
- 더 나은 singing voice style adaptation을 위해 MIDI-based prior를 score-based diffusion model에 적용
- 특히 MIDI-driven prior를 생성하여 note information을 반영하고 고품질의 style adaptation을 지원
- 추가적으로 expressive synthesis를 위해 DDSP-based MIDI-style prior를 구성
논문 (ICASSP 2024) : Paper Link

1. Introduction

Singing Voice Synthesis (SVS)는 musical score로부터 expressive, natural singing voice를 합성하는 것을 목표로 함
- 일반적인 two-stage SVS 모델은 note, lyrics, speaker ID를 input으로 mel-spectrogram을 생성하는 acoustic model과 합성된 mel-spectrogram을 waveform으로 변환하는 vocoder로 구성됨
- 한편으로 합성 품질을 향상하기 위해 Generative Adversarial Network (GAN)나 diffusion-based 모델을 고려할 수 있음
  - 이때 고품질의 합성을 위해서는 정확한 prior distribution을 생성할 수 있어야 함
  - e.g.) diffusion-based SVS 모델의 경우 Gaussian distribution 대신 data-driven prior를 활용
- 한편으로 Fundamental frequency $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mn>0</mn></math>$ 모델링 역시 SVS 작업에서 상당히 중요함
  1. Singing voice의 expressiveness는 $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mn>0</mn></math>$ 의 primary component인 baseline, microprosody, vibrato와 밀접하게 관련되어 있기 때문
    - 이를 위해 Text-to-Speech (TTS)의 $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mn>0</mn></math>$ prediction method를 활용할 수 있지만, SVS에서의 $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mn>0</mn></math>$ 예측에는 여전히 한계가 있음
  2. 특히 $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mn>0</mn></math>$ 를 explicit 하게 예측하지 않는 경우, unseen speaker에 대해 부정확한 singing melody를 생성하게 됨
    - 따라서 zero-shot SVS는 추론 단계에서 ground-truth pitch를 singing voice conversion을 위해 사용함
    - BUT, 해당 ground-truth $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mn>0</mn></math>$ 를 사용하더라도 data-driven prior로 인해 부정확한 $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mn>0</mn></math>$ 를 생성할 수 있음

-> 그래서 고품질의 zero-shot SVS를 위한 score-based diffusion SVS 모델인 MIDI-Voice를 제안

MIDI-Voice
- 기존의 data-driven prior 대신 Musical Instrument Digital Interface (MIDI)-based prior를 사용하여 singing voice Mel-spectrogram을 생성
  - 이를 통해 부정확한 $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mn>0</mn></math>$ 로 인한 SVS 품질 저하를 방지할 수 있음
- 추가적으로 singing style adaptation을 지원하는 additional information을 반영한 MIDI-style prior를 얻기 위해 Differentiable Digital Signal Processing (DDSP)을 채택
- 결과적으로 MIDI-based prior는 speaker information이 아닌 note information 만을 반영하므로 robust 한 zero-shot SVS가 가능함

< Overall of MIDI-Voice >

Zero-shot SVS를 위해 MIDI-based prior를 diffusion model에 도입
DDSP를 사용하여 MIDI-style prior에 더 나은 singing voice style transfer 능력을 반영
결과적으로 기존 방식들보다 뛰어난 zero-shot SVS 성능을 달성

2. Method

Zero-shot SVS는 target speaker와 musical score에 adapting 하여 고품질의 singing voice를 생성하는 것을 목표로 함
- 이때 MIDI-Voice는 style encoder, condition encoder, prior generator, diffusion-based Mel decoder로 구성됨

- Style Encoder

Zero-shot SVS를 위해 Meta-StyleSpeech의 style encoder를 사용하여 style vector $ω <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ω</mi></math>$ 를 추출함
- Style encoder는 spectral/temporal processor, multi-head attention과 temporal average pooling이 있는 transformer layer로 구성됨
  - 결과적으로 style encoder는 reference mel-spectrogram을 input으로 하여 style vector를 output 함
- 이때 single speaker는 전체 노래에 대해 동일한 singing style을 유지하지 않으므로, training 중에 동일한 speaker의 다른 singing voice에서 reference mel-spectrogram을 random sampling 함
  - 이를 통해 style encoder가 단순히 reference mel-spectrogram을 기반으로 singing style을 변경하지 않도록 보장

- Condition Encoder

Condition encoder는 text encoder, note encoder, auxiliary encoder의 3가지 encoder로 구성됨
- 먼저 text encoder는 lyrics의 phoneme으로부터 linguistic representation을 추출
- Note encoder는 phoneme-level note pitch sequence에서 pitch representation을 추출한 다음, length regulating operation 이전에 두 phoneme-level representation을 추가함
  - 이때 musical score로부터 duration이 이미 결정되어 있으므로 target singing voice의 duration으로 representation을 expand 할 수 있음
- Auxiliary encoder는 extended representation과 $ω <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ω</mi></math>$ 로부터 condition representation $h c o n d <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>o</mi><mi>n</mi><mi>d</mi></mrow></msub></math>$ 를 encoding 함
  1. 여기서 diffusion-based Mel decoder의 condition으로 해당 condition representation을 사용
  2. 더 정확한 pronunciation과 pitch information을 포함하는 condition representation을 얻기 위해, 다음의 condition loss $L c <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>c</mi></mrow></msub></math>$ 를 추가함:
    (Eq. 1) $L c = \sum T i = 0 (h c o n d - Y) 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>c</mi></mrow></msub><mo>=</mo><munderover><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></munderover><mo stretchy="false">(</mo><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>o</mi><mi>n</mi><mi>d</mi></mrow></msub><mo>-</mo><mi>Y</mi><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup></math>$
    - $Y <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>Y</mi></math>$ : target mel-spectrogram

- Diffusion Modelling

Diffusion model은 Markov chain을 사용하여 Gaussian distribution $N (0, I) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mi>I</mi><mo stretchy="false">)</mo></math>$ 에 의해 생성된 prior noise distribution을 점진적으로 denoise 하는 방식
- 특히 score-based diffusion의 경우 해당 denoising process에 Stochastic Differential Equation (SDE)를 도입함
  - 이때 score-based model은 Gaussian noise를 기반으로 한 prior noise distribution 대신에 data-driven prior로부터 sample을 생성할 수 있음
- 한편으로 MIDI-Voice는 data-driven prior 대신 MIDI-based prior를 사용하는 score-based diffusion model을 활용함
MIDI-driven Prior
- Zero-shot SVS에서는 prior distribution의 결정이 중요하므로, speaker information을 포함하지 않는 정확한 pitch information을 통해 MIDI-driven prior를 생성함
  - 해당 MIDI-driven prior를 통해 singing voice representation을 conditioning 하여 style을 adapting 하는 diffusion decoder의 성능을 향상할 수 있음
- 특히 diffusion model에서 data-driven prior로써 mel-spectrogram을 사용하면 diffusion-based Mel decoder의 adaptation 성능을 저하시킬 수 있음
  - 따라서 MIDI-driven prior는 FluidSynth를 사용하여 MIDI file을 waveform으로 변환한 다음, STFT을 적용하여 생성됨
MIDI-style Prior
- MIDI-style prior는 expressive SVS를 위해 desired singing style의 $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mn>0</mn></math>$ 와 loudness를 사용하여 prior를 생성함
- MIDI-style prior는 pre-trained DDSP에 대한 input으로 desired singing voice sample에서 추출된 $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mn>0</mn></math>$ 와 loudness를 사용하여 얻어짐
  - 여기서 training sample에는 reverb가 포함되지 않으므로 DDSP에서 room reverberation은 제거됨
- 결과적으로 instrumental sound를 포함하여 생성되는 MIDI-style prior에는 기존의 MIDI-driven prior 보다 더 expressive 한 style이 반영됨
Forward Diffusion
- Forward diffusion process는 Gaussian distribution $N (0, I) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mi>I</mi><mo stretchy="false">)</mo></math>$ 에서 추출된 noise를 infinite time $T <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>T</mi></math>$ 에 걸쳐 점진적으로 data에 inject 하는 과정
  - 따라서 논문에서는 MIDI-driven piror noise distribution $N (M m i d i, I) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><msub><mi>M</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>d</mi><mi>i</mi></mrow></msub><mo>,</mo><mi>I</mi><mo stretchy="false">)</mo></math>$ 에서 noisy sample을 denoise 하는 것을 목표로 함:
    (Eq. 2) $dYt=12(Mmidi−Yt)βtdt+√βtdWt,t∈[0,T]<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><mo stretchy="false">(</mo><msub><mi>M</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>d</mi><mi>i</mi></mrow></msub><mo>−</mo><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mi>d</mi><mi>t</mi><mo>+</mo><msqrt><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt><mi>d</mi><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>t</mi><mo>∈</mo><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mi>T</mi><mo stretchy="false">]</mo></math>$
    - $M m i d i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>M</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>d</mi><mi>i</mi></mrow></msub></math>$ : MIDI-driven prior / MIDI-style prior, $t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi></math>$ : continuous time step
    - $β <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>β</mi></math>$ : noise scheduling function, $W t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ : standard Brownian motion
  - 그러면 (Eq. 2)의 solution은:
    (Eq. 3) $Yt=(I−e−12∫t0βsds)Mmidi+e−12∫t0βsdsY0+∫t0√βs−e−12∫tsβududWs<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mo stretchy="false">(</mo><mi>I</mi><mo>−</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msubsup><mo data-mjx-texclass="OP">∫</mo><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msubsup><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mi>d</mi><mi>s</mi></mrow></msup><mo stretchy="false">)</mo><msub><mi>M</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>d</mi><mi>i</mi></mrow></msub><mo>+</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msubsup><mo data-mjx-texclass="OP">∫</mo><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msubsup><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mi>d</mi><mi>s</mi></mrow></msup><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>+</mo><msubsup><mo data-mjx-texclass="OP">∫</mo><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msubsup><msqrt><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub></msqrt><mo>−</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msubsup><mo data-mjx-texclass="OP">∫</mo><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msubsup><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>u</mi></mrow></msub><mi>d</mi><mi>u</mi></mrow></msup><mi>d</mi><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub></math>$
  - Ito's integral에 따라, transition density $p (Y t | Y 0) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo stretchy="false">(</mo><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 는 다음의 Gaussian distribution $λ (I, t) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>λ</mi><mo stretchy="false">(</mo><mi>I</mi><mo>,</mo><mi>t</mi><mo stretchy="false">)</mo></math>$ 와 같음:
    (Eq. 4) $p(Yt|Y0)=(I−e−12∫t0βsds)Mmidi+e−12∫t0βsdsY0,λ(I,T)=I−e−12∫t0βsds<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo stretchy="false">(</mo><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mo stretchy="false">(</mo><mi>I</mi><mo>−</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msubsup><mo data-mjx-texclass="OP">∫</mo><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msubsup><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mi>d</mi><mi>s</mi></mrow></msup><mo stretchy="false">)</mo><msub><mi>M</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>d</mi><mi>i</mi></mrow></msub><mo>+</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msubsup><mo data-mjx-texclass="OP">∫</mo><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msubsup><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mi>d</mi><mi>s</mi></mrow></msup><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>λ</mi><mo stretchy="false">(</mo><mi>I</mi><mo>,</mo><mi>T</mi><mo stretchy="false">)</mo><mo>=</mo><mi>I</mi><mo>−</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msubsup><mo data-mjx-texclass="OP">∫</mo><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msubsup><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mi>d</mi><mi>s</mi></mrow></msup></math>$
- 따라서 $Y t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 는 $Y 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 에 관계없이 $N (M m i d i, I) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><msub><mi>M</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>d</mi><mi>i</mi></mrow></msub><mo>,</mo><mi>I</mi><mo stretchy="false">)</mo></math>$ 으로 수렴하고, SDE는 data distribution을 $N (M m i d i, I) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><msub><mi>M</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>d</mi><mi>i</mi></mrow></msub><mo>,</mo><mi>I</mi><mo stretchy="false">)</mo></math>$ 로 변환함
Reverse Diffusion
- Reverse diffusion process는 noise에서 data sample까지 점진적으로 denoising을 수행하는 것
  1. 여기서 reverse diffusion에 대한 SDE는:
    (Eq. 5) $dYt=12((Mmidi−Yt)−∇logpt(Yt))βtdt+√βtd˜Wt<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mo stretchy="false">(</mo><msub><mi>M</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>d</mi><mi>i</mi></mrow></msub><mo>−</mo><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo><mo>−</mo><mi mathvariant="normal">∇</mi><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo><mo data-mjx-texclass="CLOSE">)</mo></mrow><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mi>d</mi><mi>t</mi><mo>+</mo><msqrt><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt><mi>d</mi><msub><mrow data-mjx-texclass="ORD"><mover><mi>W</mi><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$
    - $p t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ : random variable $Y t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 의 probability density function
    - $˜ W t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mi>W</mi><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ : reverse Brownian motion
  2. 한편으로 다음의 ordinary differential equation을 고려할 수도 있음:
    (Eq. 6) $dYt=12((Mmidi−Yt)−∇logpt(Yt))βtdt,t∈[0,T]<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mo stretchy="false">(</mo><msub><mi>M</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>d</mi><mi>i</mi></mrow></msub><mo>−</mo><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo><mo>−</mo><mi mathvariant="normal">∇</mi><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo><mo data-mjx-texclass="CLOSE">)</mo></mrow><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mi>d</mi><mi>t</mi><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>t</mi><mo>∈</mo><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mi>T</mi><mo stretchy="false">]</mo></math>$
- 결과적으로 SDE를 사용하여 $Y t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 에서 $Y 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 생성할 수 있음
  - 즉, MIDI-Voice는 $N (M m i d i, I) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><msub><mi>M</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>d</mi><mi>i</mi></mrow></msub><mo>,</mo><mi>I</mi><mo stretchy="false">)</mo></math>$ 에서 sampling 된 $Y t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 로부터 $Y 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 생성함

- Training

MIDI-Voice는 noisy data의 log-density에 해당하는 estimated gradient에 대한 기댓값을 계산함
- 여기서 time $t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi></math>$ 까지 accumulate 된 noise로 corrupt 된 data $Y 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 의 log-density gradient를 추정하는 loss function은:
  (Eq. 7) $L d i f f = E ϵ t [| | s θ (Y t, M m i d i, h c o n d, ω, t) + λ (I, t) - 1 ϵ t | |] <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>i</mi><mi>f</mi><mi>f</mi></mrow></msub><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mrow data-mjx-texclass="ORD"><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mrow></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><msub><mi>M</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>d</mi><mi>i</mi></mrow></msub><mo>,</mo><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>o</mi><mi>n</mi><mi>d</mi></mrow></msub><mo>,</mo><mi>ω</mi><mo>,</mo><mi>t</mi><mo stretchy="false">)</mo><mo>+</mo><mi>λ</mi><mo stretchy="false">(</mo><mi>I</mi><mo>,</mo><mi>t</mi><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo>-</mo><mn>1</mn></mrow></msup><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>$
  - $ω <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ω</mi></math>$ : style vector, $ϵ t \in N (0, λ (I, t)) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>\in</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mi>λ</mi><mo stretchy="false">(</mo><mi>I</mi><mo>,</mo><mi>t</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$
  - $s θ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub></math>$ : noise estimation network
- 최종적으로 MIDI-Voice는 noise estimator와 condition encoder를 jointly optimize 함:
  (Eq. 8) $L = L d i f f + L c <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>i</mi><mi>f</mi><mi>f</mi></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>c</mi></mrow></msub></math>$
  - $L d i f f <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>i</mi><mi>f</mi><mi>f</mi></mrow></msub></math>$ : (Eq. 7)의 diffusion loss, $L c <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>c</mi></mrow></msub></math>$ : (Eq. 1)의 condition encoder loss

3. Experiments

- Settings

Dataset : Guide Vocal Dataset
Comparisons : VISinger, Grad-TTS

- Results

먼저 seen speaker에 대한 결과를 확인해 보면, MIDI-Voice가 가장 뛰어난 합성 성능을 보임
- 특히 data-driven prior 대신 MIDI-based prior를 사용하는 경우 diffusion model의 성능이 크게 향상됨

Unseen speaker에 대한 zero-shot test의 경우에 대해서도 MIDI-Voice의 성능이 가장 뛰어남
- 이때 MIDI-Voice는 unseen speaker의 $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mn>0</mn></math>$ 를 정확하게 반영할 수 있음

Ablation study 측면에서
- Data-driven prior는 unseen speaker의 style을 반영하는 능력이 떨어지지만, MIDI-based prior는 note information이 포함되므로 zero-shot SVS에 대한 diffusion model의 adaptation을 향상할 수 있음
- Diffusion process의 iteration step에 따른 adaptation 성능을 비교해 보면
  1. Iteration step을 증가시키더라도 data-driven prior에는 이미 많은 양의 data가 포함되어 있기 때문에 adaptation의 한계가 있음
  2. 반면 MIDI-driven prior는 iteration step을 증가시켰을 때, adaptation 성능을 향상할 수 있음

'Paper > SVS' 카테고리의 다른 글

[Paper 리뷰] TokSing: Singing Voice Synthesis based on Discrete Tokens (0)	2024.07.11
[Paper 리뷰] Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt (0)	2024.06.22
[Paper 리뷰] SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-Filter Model (0)	2024.05.03
[Paper 리뷰] StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis (0)	2024.03.26
[Paper 리뷰] Singing Voice Synthesis based on a Musical Note Position-aware Attention Mechanism (0)	2024.02.29

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] MIDI-Voice: Expressive Zero-Shot Singing Voice Synthesis via MIDI-Driven Priors

MIDI-Voice: Expressive Zero-Shot Singing Voice Synthesis via MIDI-Driven Priors

1. Introduction

2. Method

- Style Encoder

- Condition Encoder

- Diffusion Modelling

- Training

3. Experiments

- Settings

- Results

'Paper > SVS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역