[Paper 리뷰] Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-Shot Speaker Adaptation

티스토리 뷰

Paper/Conversion

[Paper 리뷰] Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-Shot Speaker Adaptation

feVeRin 2024. 10. 20. 10:29

Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-Shot Speaker Adaptation

Voice Conversion은 여전히 inaccurate pitch와 low speaker adaptation 문제를 가지고 있음
Diff-HierVC
- 2가지 diffusion model을 기반으로 하는 hierarchical voice conversion model
  - Target voice style로 $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 효과적으로 생성할 수 있는 DiffPitch를 도입하고,
  - 이후 생성된 $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 DiffVoice에 전달하여 target voice style로 speech를 변환
- Source-filter encoder를 사용하여 speech를 disentangle 하고 mel-spectrogram을 DiffVoice에서 data-driven prior로 채택하여 style transfer capacity를 개선
- 추가적으로 diffusion model에 masked prior를 도입하여 speaker adaptation 성능을 향상
논문 (INTERSPEECH 2023) : Paper Link

1. Introduction

Voice Conversion (VC)는 source speaker voice를 target speaker voice로 변환하는 것을 목표로 함
- 효과적인 VC를 위해서는 AutoVC와 같이 speech의 individual component를 disentanlge 하여 각 component를 control 하고 target speaker voice로 변환해야 함
  - BUT, converted voice는 여전히 mispronunciation, low speaker adpatation 문제를 가지고 있음
- 한편으로 pitch modeling은 speech intelligibility와 naturalness를 향상하는데 필수적임
  1. 특히 FastPitch와 같이 text-to-speech task에서 pitch characteristic은 speaker identity와 correct pronunciation에 상당한 영향을 미침
  2. VC task에서도 normalized fundamental frequency $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 사용하여 expressiveness를 향상할 수 있지만, $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 는 speaker style과 entirely separate 되지 않으므로 perceptual unnaturalness가 발생함
    - 이를 해결하기 위해 Vector-Quantized Variational AutoEncoder (VQ-VAE)으로 speaker-irrelevant pitch representation을 학습할 수 있지만, VQ로 인해 pitch information이 loss 되는 한계가 있음
- 최근의 DiffVC, DDDM-VC 등은 diffusion model을 기반으로 expressive VC를 수행함
  - 마찬가지로 $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 생성하기 위해 해당 diffusion process를 고려할 수 있음

-> 그래서 hierarchical diffusion model을 통해 VC task의 pitch controllability를 개선한 Diff-HierVC를 제안

Diff-HierVC
- DiffPitch와 DiffVoice의 2가지 diffusion model을 활용하여 disentangled speech representation으로부터 voice style을 hierarchical 하게 변환
  1. DiffPitch는 추론 과정에서 target speaker의 pitch information을 생성하고,
  2. DiffVoice는 생성된 pitch information과 source-filter representation을 사용하여 mel-spectrogram을 생성함
- 추가적으로 data-driven prior를 통해 diffusion model의 denoising process를 regulating 하고, masked prior로 확장하여 generalizability를 개선

< Overall of Diff-HierVC >

Robust pitch generation을 위해 hierarchical diffusion-based VC model을 구축하고 expressive zero-shot VC를 위해 masked prior를 도입
결과적으로 기존보다 더 나은 conversion 성능을 달성

2. Method

- Speech Disentanglement

Diff-HierVC는 speech를 content, pitch, style의 representation으로 analyze 함:
1. Content Representation
  - 먼저 data perturbation을 input waveform에 적용하여 content-irrelevant information을 제거함
  - 이후 large-scale cross-lingual speech dataset으로 pre-train 된 self-supervised model인 XLS-R의 intermediate layer에서 content feature를 추출함
2. Speech Style
  - Meta-StyleSpeech의 style encoder를 사용하여 mel-spectrogram의 speaker style representation을 추출함
  - 해당 style embedding은 content encoder와 pitch encoder에 대한 guide 역할을 수행
3. Pitch Representation
  - 정확한 pitch를 추출하기 위해, mel-spectrogram 보다 4배 높은 high-resolution으로 YAAPT algorithm을 적용하여 fundamental frequency $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 추출함
  - 이때 content encoder는 $log (F 0 + 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mo stretchy="false">(</mo><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>+</mo><mn>1</mn><mo stretchy="false">)</mo></math>$ 을 receive 하고 pitch encoder는 normalized $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 source speaker $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 의 평균/분산으로 사용함

- Hierarchical VC

Hierarchical VC를 위해 논문은 DiffPitch와 DiffVoice의 two-stage diffusion model을 도입함
- DiffPitch는 $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 target voice style로 변환하고,
- DiffVoice는 변환된 $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 사용하여 speech를 target voice style로 hierarchical 하게 변환함
DiffPitch
- DiffPitch는 diffusion process를 기반으로 하는 pitch generator에 해당함
  - 구조적으로 DiffPitch는 single denoiser로 significant receptive field를 얻을 수 있는 WaveNet-based conditional diffusion model인 DiffWave를 사용
- 먼저 pitch encoder는 source speech의 normalized $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 pitch representation $Z p <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msub></math>$ 로 변환함
  1. 그러면 $Z p <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msub></math>$ 를 DiffPitch의 data-driven prior로 활용하기 위해 pitch reconstruction loss로 pitch representation을 regularize 할 수 있음:
    (Eq. 1) $L p i t c h = | | X p - Z p | | 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>i</mi><mi>t</mi><mi>c</mi><mi>h</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msub><mo>-</mo><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mo stretchy="false">|</mo><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$
  2. 여기서 DiffPitch의 forward process는:
    (Eq. 2) $dXp,t=12βt(Zp−Xp,t)dt+√βtdwt<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mo>,</mo><mi>t</mi></mrow></msub><mo>=</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msub><mo>−</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mo>,</mo><mi>t</mi></mrow></msub><mo stretchy="false">)</mo><mi>d</mi><mi>t</mi><mo>+</mo><msqrt><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt><mi>d</mi><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">w</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$
    - DiffPitch의 diffusion process는 YAAPT algorithm으로 추출된 log-scale $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 target ground-truth $X p <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msub></math>$ 로 사용함
    - $t \in [0, 1] <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi><mo>\in</mo><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">]</mo></math>$ , $β t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ : noise schedule, $w t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">w</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ : forward standard Wiener process
  3. DiffPitch는 reverse process에서 original pitch contour를 recover 하기 위해 denoising을 수행함:
    (Eq. 3) $dˆXp,t=(12(Zp−ˆXp,t)−sθp(ˆXp,t,Zp,t))βtdt+√βtdˉwt<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi><msub><mrow data-mjx-texclass="ORD"><mover><mi>X</mi><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mo>,</mo><mi>t</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><mo stretchy="false">(</mo><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msub><mo>−</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>X</mi><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mo>,</mo><mi>t</mi></mrow></msub><mo stretchy="false">)</mo><mo>−</mo><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msub></mrow></msub><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>X</mi><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mo>,</mo><mi>t</mi></mrow></msub><mo>,</mo><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msub><mo>,</mo><mi>t</mi><mo stretchy="false">)</mo><mo data-mjx-texclass="CLOSE">)</mo></mrow><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mi>d</mi><mi>t</mi><mo>+</mo><msqrt><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt><mi>d</mi><msub><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">w</mi></mrow><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$
    - $ˉ w t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">w</mi></mrow><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ : backward standard Wiener process
  4. DiffVC를 따라, forward process의 noisy pitch sample은 다음 distribution으로부터 얻어짐:
    (Eq. 4) $pt|0(Xp,t|Xp,0)=N(e−12∫t0βsdsXp,0+(1−e−12∫t0βsds)Zp,(1−e−∫t0βsds)I)<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mn>0</mn></mrow></msub><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mo>,</mo><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mo>,</mo><mn>0</mn></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msubsup><mo data-mjx-texclass="OP">∫</mo><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msubsup><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mi>d</mi><mi>s</mi></mrow></msup><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mo>,</mo><mn>0</mn></mrow></msub><mo>+</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mn>1</mn><mo>−</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msubsup><mo data-mjx-texclass="OP">∫</mo><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msubsup><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mi>d</mi><mi>s</mi></mrow></msup><mo data-mjx-texclass="CLOSE">)</mo></mrow><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msub><mo>,</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mn>1</mn><mo>−</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><msubsup><mo data-mjx-texclass="OP">∫</mo><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msubsup><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mi>d</mi><mi>s</mi></mrow></msup><mo data-mjx-texclass="CLOSE">)</mo></mrow><mi>I</mi><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$
    - $I <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>I</mi></math>$ : identity matrix
  5. (Eq. 4)는 Gaussian이므로 다음을 얻을 수 있음:
    (Eq. 5) $∇logpt|0(Xp,t|Xp,0)=−Xp,t−Xp,0(e−12∫t0βsds)−Zp(1−e−12∫t0βsds)1−e−∫t0βsds<math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">∇</mi><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mn>0</mn></mrow></msub><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mo>,</mo><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mo>,</mo><mn>0</mn></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mo>−</mo><mfrac><mrow><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mo>,</mo><mi>t</mi></mrow></msub><mo>−</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mo>,</mo><mn>0</mn></mrow></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msubsup><mo data-mjx-texclass="OP">∫</mo><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msubsup><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mi>d</mi><mi>s</mi></mrow></msup><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo>−</mo><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mn>1</mn><mo>−</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msubsup><mo data-mjx-texclass="OP">∫</mo><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msubsup><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mi>d</mi><mi>s</mi></mrow></msup><mo data-mjx-texclass="CLOSE">)</mo></mrow></mrow><mrow><mn>1</mn><mo>−</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><msubsup><mo data-mjx-texclass="OP">∫</mo><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msubsup><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mi>d</mi><mi>s</mi></mrow></msup></mrow></mfrac></math>$
  6. 결과적으로 DiffPitch는 다음의 denoising objective로 score function을 근사함:
    (Eq. 6)
    - $s θ p <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msub></mrow></msub></math>$ : pitch score estimator, $λ t = 1 - e - \int t 0 β s d s <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>λ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mn>1</mn><mo>-</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>-</mo><msubsup><mo data-mjx-texclass="OP">\int</mo><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msubsup><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mi>d</mi><mi>s</mi></mrow></msup></math>$
- 추가적으로 논문은 reverse SDE solver로 forward diffusion의 log-likelihood를 최대화하는 ML-SDE solver를 사용하여 fast sampling을 유도함
- 추론 시, pitch encoder에서 변환된 $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 는 DiffPitch의 prior로 사용되고 DiffPitch는 target voice style $s <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>s</mi></math>$ 를 사용하여 refined $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 생성함
DiffVoice
- DiffVoice는 content, target $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ , target voice style로부터 high-quality speech synthesis를 수행하는 conditional diffusion model
  - 추가적으로 논문은 diffusion model에 data-driven prior를 도입하여 inception을 guide 함
- 먼저 source-filter theory에 따라 speech component를 pitch/content representation으로 disentangle 함
  1. Data-driven prior를 위해 source encoder $E s r c <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>E</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>r</mi><mi>c</mi></mrow></msub></math>$ 와 filter encoder $E f t r <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>E</mi><mrow data-mjx-texclass="ORD"><mi>f</mi><mi>t</mi><mi>r</mi></mrow></msub></math>$ 로 구성된 source-filter encoder는:
    - Disentangled speech representation에서 intermediate mel-spectrogram $Z m <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>m</mi></mrow></msub></math>$ 을 $Z m = Z s r c + Z f t r <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>m</mi></mrow></msub><mo>=</mo><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>r</mi><mi>c</mi></mrow></msub><mo>+</mo><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>f</mi><mi>t</mi><mi>r</mi></mrow></msub></math>$ 과 같이 reconstruct 함
    - $Z s r c = E s r c (F 0, s), Z f t r = E f t r (content, s) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>r</mi><mi>c</mi></mrow></msub><mo>=</mo><msub><mi>E</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>r</mi><mi>c</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>,</mo><mi>s</mi><mo stretchy="false">)</mo><mo>,</mo><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>f</mi><mi>t</mi><mi>r</mi></mrow></msub><mo>=</mo><msub><mi>E</mi><mrow data-mjx-texclass="ORD"><mi>f</mi><mi>t</mi><mi>r</mi></mrow></msub><mo stretchy="false">(</mo><mtext>content</mtext><mo>,</mo><mi>s</mi><mo stretchy="false">)</mo></math>$ , $s <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>s</mi></math>$ : style embedding
  2. 그러면 mel-spectrogram $Z m <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>m</mi></mrow></msub></math>$ 은 다음과 같이 regularize 됨:
    (Eq. 7) $L r e c = | | X m e l - Z m | | 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>r</mi><mi>e</mi><mi>c</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>e</mi><mi>l</mi></mrow></msub><mo>-</mo><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>m</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mo stretchy="false">|</mo><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$
    - $X m e l <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>e</mi><mi>l</mi></mrow></msub></math>$ : ground-truth speech의 mel-spectrogram
- 여기서 DiffVoice는 source-filter encoder output $Z m <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>m</mi></mrow></msub></math>$ 을 prior로 사용하고 speaker representation $s <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>s</mi></math>$ 를 condition으로 하여 speaker adaptation capacity를 향상할 수 있음
  1. 이때 DiffVoice의 forward process는:
    (Eq. 8) $dXm,t=12βt(Zm−Xm,t)dt+√βtdwt<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mo>,</mo><mi>t</mi></mrow></msub><mo>=</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>m</mi></mrow></msub><mo>−</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mo>,</mo><mi>t</mi></mrow></msub><mo stretchy="false">)</mo><mi>d</mi><mi>t</mi><mo>+</mo><msqrt><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt><mi>d</mi><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">w</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$
  2. DiffVoice의 reverse process는:
    (Eq. 9) $dˆXm,t=(12(Zm−ˆXm,t)−sθm(ˆXm,t,Zm,t))βtdt+√βtdˉwt<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi><msub><mrow data-mjx-texclass="ORD"><mover><mi>X</mi><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mo>,</mo><mi>t</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><mo stretchy="false">(</mo><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>m</mi></mrow></msub><mo>−</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>X</mi><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mo>,</mo><mi>t</mi></mrow></msub><mo stretchy="false">)</mo><mo>−</mo><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>m</mi></mrow></msub></mrow></msub><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>X</mi><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mo>,</mo><mi>t</mi></mrow></msub><mo>,</mo><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>m</mi></mrow></msub><mo>,</mo><mi>t</mi><mo stretchy="false">)</mo><mo data-mjx-texclass="CLOSE">)</mo></mrow><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mi>d</mi><mi>t</mi><mo>+</mo><msqrt><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt><mi>d</mi><msub><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">w</mi></mrow><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$
  3. Forward process에서 noisy mel-spectrogram $X m, t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mo>,</mo><mi>t</mi></mrow></msub></math>$ 는 (Eq. 4)와 동일하게 얻어지므로, mel-spectrogram noise estimation network $s θ m <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>m</mi></mrow></msub></mrow></msub></math>$ 을 최적화하기 위한 score matching loss는:
    (Eq. 10)
- 추론 시 source-filter encoder는 source speech, target voice style $s <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>s</mi></math>$ , DiffPitch로 변환된 $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 에서 content representation을 취함
  - 결과적으로 source-filter encoder에서 변환된 mel-spectrogram $Z m <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>m</mi></mrow></msub></math>$ 은 data-driven prior로 사용되고 DiffVoice는 target voice style로 condition 된 converted speech를 생성함

- Denoising Models with Masked Prior

앞선 Data-driven piror는 VC 성능을 크게 개선할 수 있지만, 논문의 DiffVoice는 source-filter encoder에서 reconstruct 된 mel-spectrogram에 의존함
- 따라서 논문은 해당 DiffVoice의 generalizability를 향상하기 위해 maksed prior를 diffusion model에 도입함
  - 구체적으로 prior $Z m <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>m</mi></mrow></msub></math>$ 은 DiffVoice에 전달되기 전에 mask 되고, diffusion network는 reconstruction과 denoising process를 jointly learning 함
- 결과적으로 Diff-HierVC는 surrounding context에서 masked area를 reconstruction 할 수 있음
  - 이때 논문은 context 측면에서 continuous pitch information을 interpreting 하여 frequency masking을 적용

3. Experiments

- Settings

Dataset : LibriTTS, VCTK
Comparisons : AutoVC, DiffVC, VoiceMixer, Speech Resynthesis (SR)

- Results

Analysis on $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ Prediction
- $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ statistic을 통한 $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ transformation, WaveNet을 통한 $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ prediction, DiffPitch를 통한 diffusion-based $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ prediction을 각각 비교해 보면
- 30-iteration step을 가지는 DiffPitch를 사용할 때 ground-truth와 가장 유사한 $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ contour를 얻을 수 있음

$F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ Reconstruction

추가적으로 DiffPitch는 target voice style에 대해 다양한 pitch contour를 가짐
- 따라서 Diff-HierVC는 DiffPitch를 VC task 중에 적용하여 $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 변환함

Target Speech에 대한 $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ Generation

Zero-Shot VC
- 전체적으로 Diff-HierVC가 가장 우수한 VC 성능을 보임

Unseen language에 대한 cross-lingual VC 측면에서도 뛰어난 성능을 보임

Ablation Study
- Ablation Study 측면에서 각 component를 제거하는 경우 성능 저하가 발생함

Masked prior를 사용하면 data-driven prior 보다 더 나은 generalization을 얻을 수 있지만, 적절한 masking ratio의 선정이 필요함
- 실험적으로 30%의 masking ratio가 가장 적합함

'Paper > Conversion' 카테고리의 다른 글

[Paper 리뷰] Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features (0)	2024.12.28
[Paper 리뷰] DualVC3: Leveraging Language Model Generated Pseudo Context for End-to-End Low Latency Streaming Voice Conversion (0)	2024.12.25
[Paper 리뷰] StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion (0)	2024.10.13
[Paper 리뷰] DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion (0)	2024.10.06
[Paper 리뷰] DiffVC: Diffusion-based Voice Conversion with Fast Maximum Likelihood Sampling Scheme (0)	2024.10.05

Let IT Begin Voice Engineer | 심심하면 앨범 리뷰 올립니다

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-Shot Speaker Adaptation

Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-Shot Speaker Adaptation

1. Introduction

2. Method

- Speech Disentanglement

- Hierarchical VC

- Denoising Models with Masked Prior

3. Experiments

- Settings

- Results

'Paper > Conversion' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역