[Paper 리뷰] DreamVoice: Text-Guided Voice Conversion

티스토리 뷰

Paper/Conversion

[Paper 리뷰] DreamVoice: Text-Guided Voice Conversion

feVeRin 2024. 8. 31. 08:48

DreamVoice: Text-Guided Voice Conversion

Text-guided generation을 활용하면 user need에 따른 음성을 합성할 수 있음
DreamVoice
- End-to-End diffusion-based text-guided voice conversion을 위한 DreamVC와 text-to-voice generation을 위한 DreamVG를 제공
- 추가적으로 VCTK, LibriTTS에 대한 voice timbre annotation을 가진 DreamVoiceDB dataset을 구축
논문 (INTERSPEECH 2024) : Paper Link

1. Introduction

Voice Conversion (VC)는 training/inference 중에 target voice의 robust, accessible representation을 제공해야 함
- 기존의 one-shot VC에서는 pre-trained speaker embedding을 활용해 representation을 추출하지만, target recording이나 embedding이 항상 accessible 한 것은 아님
- 따라서 최근에는 text-guided control이나 conditioning을 활용해 VC에 flexibility를 제공하는 방식을 활용함
- BUT, text-based 방식은 수집하기 까다롭고 text annotation에 의해 품질이 크게 좌우된다는 문제가 있음
  - 대표적으로 PromptTTS++는 LibriTTS dataset에 대한 keyword-based marking strategy를 제시했지만, annotated data는 공개하지 않음
  - PromptVC는 style, timbre에 대한 internal dataset를 활용함

-> 그래서 open source dataset인 DreamVoiceDB와 text-guided generation을 위한 DreamVoice를 제안

DreamVoice
- Text-guided Voice Generation과 Voice Conversion에 대한 두 가지 model을 제시
  1. DreamVC
    - Diffusion Probabilistic Model (DPM)과 Classifier-Free Guidance (CFG)를 기반으로 하는 VC model
  2. DreamVG
    - DPM, CFG를 사용하여 speaker embedding을 생성하고, one-shot VC model에 plugin 될 수 있는 text-to-voice generation model
- 추가적으로 LibriTTS, VCTK dataset에서 sample 된 900 speaker를 annotation 하여 얻어진 open-source dataset인 DreamVoiceDB를 제공

< Overall of DreamVoice >

Text-guided generation을 위해 DPM, CFG를 도입하고, 추가적으로 open-source dataset인 DreamVoiceDB를 활용
결과적으로 기존보다 뛰어난 합성 품질을 달성

2. Method

- General Voice Conversion Pipeline

대부분의 VC model은 source speech의 content를 가져와 target speaker timbre와 mixing 한 다음, converted voice를 생성함
- 여기서 rich content information을 반영하기 위해, large pre-trained Speech Language Model (SLM)에서 추출된 latent feature를 source speaker의 content embedding으로 사용할 수 있음
- One-shot VC의 경우, pre-trained speaker verification model을 활용하여 target speaker embedding을 추출함
  - 이후 content embedding과 speaker embedding으로 conditioning 하여 target speaker의 timbre로 합성
- 구조적으로는 StarGAN-VC와 같은 Generative Adversarial Network (GAN)이나 diffusion model을 활용함
  - 이때 diffusion model은 GAN 보다 더 뛰어난 합성 품질을 보임

(a) DreamVC (b) DreamVG (c) Plugin Strategy

- Diffusion Models and Classifier-Free Guidance

Diffusion Probabilistic Model (DPM)은 forward/backward process로 구성됨
- 먼저 forward process는 schedule $\beta_{1},...,\beta_{T}$에 따라 data에 Gaussian noise를 점진적으로 추가함:
  (Eq. 1) $q(x_{1:T}|x_{0}):=\prod_{t=1}^{T}q(x_{t}|x_{t-1})$
  (Eq. 2) $q(x_{t}|x_{t-1}):=\mathcal{N}\left(x_{t};\sqrt{1-\beta_{t}}x_{t-1},\beta_{t}I\right)$
- Forward process는 다음과 같은 closed form을 기반으로 arbitrary timestep $t$의 data $x_{t}$에 대한 sampling을 수행할 수 있음:
  (Eq. 3) $q(x_{t}|x_{0}):=\mathcal{N}\left(x_{t};\sqrt{\bar{\alpha}_{t}}x_{0},(1-\bar{\alpha}_{t})I\right)$
  - 이는 다음과 동치:
  (Eq. 4) $x_{t}:=\sqrt{\bar{\alpha}_{t}}x_{0}+\sqrt{1-\bar{\alpha}_{t}\epsilon}, \,\,\, \text{where}\,\, \epsilon\sim\mathcal{N}(0,I)$
  - $\alpha_{t}:=1-\beta_{t}, \bar{\alpha}_{t}:=\prod_{s=1}^{t}\alpha_{s}$
- Backward process는 information을 iteratively recover 하여 random Gaussian noise에서 new data를 생성함
  1. 해당 process에서 각 timestep의 noise variance $\beta_{t}$가 충분히 작다면, reverse step이 Gaussian distribution과 align 되므로 gradual denoising이 가능함:
    (Eq. 5) $p_{\theta}(x_{0:T}):=p(x_{T})\prod_{t=1}^{T}p_{\theta}(x_{t-1}|x_{t})$
    (Eq. 6) $p_{\theta}(x_{t-1}|x_{t}):=\mathcal{N}\left(x_{t-1};\tilde{\mu}_{t},\tilde{\beta}_{t}I\right)$
    - Variance $\tilde{\beta}_{t}$는 forward process posterior에서 $\tilde{\beta}_{t}:=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\beta_{t}$로 계산됨
  2. 이때 논문에서는 $\beta_{t},\alpha_{t}$에 fixed noisy schedule을 적용하고 noise $\epsilon$ 대신 verocity $v_{t}$를 neural network로 예측함:
    (Eq. 7) $v_{t}:=\sqrt{\bar{\alpha}_{t}}\epsilon-\sqrt{1-\bar{\alpha}_{t}}x_{0}$
  3. (Eq. 4), (Eq. 7)에 따라 backward process는 다음과 같이 수행됨:
    (Eq. 8) $x_{0}:=\sqrt{\bar{\alpha}_{t}}x_{t}-\sqrt{1-\bar{\alpha}_{t}}v_{t}$
    (Eq. 9) $\tilde{\mu}_{t}:=\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_{t}}{1-\bar{\alpha}_{t}}x_{0}+ \frac{\sqrt{\alpha}_{t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_{t}}x_{t}$
- Classifier-Free Guidance (CFG)는 diffusion model에서 sampling process를 steer 하기 위해 사용됨
  1. 즉, CFG는 다음과 같이 sampling 중에 model output $v$를 modifiy 함:
    (Eq. 10) $v_{cfg}=v_{neg}+w(v_{pos}-v_{neg})$
    - $w$ : guidance scale, $v_{cfg}$ : classifier-free guided velocity
    - $v_{pos}, v_{neg}$ : positive/negative condition에서의 model output
  2. 추가적으로 $v_{cfg}$의 effectiveness를 향상하고 $w$에 대한 overexposure를 완화하기 위해 rescaling을 적용하면:
    (Eq. 11) $v_{re}=v_{cfg}\cdot\frac{\text{std}(v_{pos})}{\text{std}(v_{cfg})}$
    (Eq. 12) $v'_{cfg}=\phi\cdot v_{re}+(1-\phi)\cdot v_{cfg}$
    - $\phi$ : rescale strength를 control 하기 위한 hyperparameter
    - $v'_{cfg}$ : diffusion sampling에 사용되는 rescaled CFG velocity

- DreamVC: Text-to-Voice Conversion Model

DreamVC는 주어진 text prompt에 따라 source speech의 timbre를 modify 하기 위해 text-guided process를 활용함
- 구조적으로는 speech content와 text prompt를 dual condition으로 하여 output을 guide 하는 conditional diffusion model을 기반으로 함
  - 이후 output mel-spectrogram은 pre-trained neural vocoder를 통해 waveform으로 변환됨
- DreamVC는 기존의 DiffVC와 달리 다음의 차이점을 가짐:
  1. Average-voice encoder를 사용하지 않고 pre-trained SLM을 통해 voice와 content를 disentangling 함
  2. Cross-attention layer를 통해 text prompt를 merge 하고, CFG를 사용해 condition impact를 control 함

- DreamVG: Text-to-Voice Generation Plugin

앞선 DreamVC는 diffusion에 기반하므로 추론 속도가 느리고 memory 사용량이 크다는 한계가 있음
- 따라서 DreamVG에서는 conditional diffusion model을 사용하여 text prompt embedding을 효율적으로 생성하기 위해 plug-and-use strategy를 채택함
- 해당 DreamVG module을 통해 latent speaker embedding을 통해 target speech를 생성하는 one-shot VC model을 대체할 수 있음
  - 결과적으로 plugin method를 통해 pre-trained one-shot VC model의 functionality를 향상하여 flexible text-guidance를 가능하게 함

3. Experiments

- Settings

Dataset : VCTK, LibriTTS, DreamVoiceDB
Comparisons : FreeVC, ReDiffVC

- Results

전체적으로 text-guide를 사용한 DreamVoice가 가장 우수한 성능을 보임

'Paper > Conversion' 카테고리의 다른 글

[Paper 리뷰] ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-Supervised Speech Representations (0)	2024.09.02
[Paper 리뷰] PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts (0)	2024.09.01
[Paper 리뷰] FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion (0)	2024.08.28
[Paper 리뷰] StreamVC: Real-Time Low-Latency Voice Conversion (0)	2024.08.27
[Paper 리뷰] S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations (0)	2024.08.25

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] DreamVoice: Text-Guided Voice Conversion

DreamVoice: Text-Guided Voice Conversion

1. Introduction

2. Method

- General Voice Conversion Pipeline

- Diffusion Models and Classifier-Free Guidance

- DreamVC: Text-to-Voice Conversion Model

- DreamVG: Text-to-Voice Generation Plugin

3. Experiments

- Settings

- Results

'Paper > Conversion' 카테고리의 다른 글

티스토리툴바