[Paper 리뷰] PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts

티스토리 뷰

Paper/Conversion

[Paper 리뷰] PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts

feVeRin 2024. 9. 1. 10:10

PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts

기존의 voice conversion은 pre-defined label이나 reference speech에 의존적이므로 style의 한계가 있음
PromptVC
- Latent diffusion model을 활용하여 natural language prompt에 의해 driven 된 style vector를 생성
- Style expressiveness를 향상하기 위해 HuBERT를 활용하여 discrete token을 추출하고, $k$-means center embedding을 적용하여 residual style information을 최소화
- 추가적으로 동일한 discrete token을 deduplicate하고 differentiable duration predictor를 통해 각 token의 duration을 예측
논문 (ICASSP 2024) : Paper Link

1. Introduction

Voice Conversion (VC)는 linguistic content를 유지하면서 original speaker를 target voice로 변환하는 것을 목표로 함
- 기존 VC system은 대부분 style control을 위해 auxiliary label이나 reference speech에 의존함
  1. Categorical label-based VC는 individual style을 나타내는 pre-defined auxiliary categorical label을 활용하여 conversion을 control 하는 방식
  2. Reference speech-based VC는 reference encoder를 사용하여 Global Style Token과 같은 expressive style을 추출하여 사용함
    - BUT, 두 방식 모두 user-friendly하지 않고, style expressiveness가 떨어진다는 단점이 있음
- 한편으로 natural language로 구성된 text description을 활용하면 flexible한 변환이 가능함
  - 특히 text prompt를 활용하는 InstructTTS, PromptTTS 등이 음성 합성에서 우수한 성능을 보이고 있음

-> 그래서 VC task를 text prompt를 활용하여 control하는 PromptVC를 제안

PromptVC
- Natural language prompt에 따라 style vector를 생성하는 latent diffusion model을 활용
  1. 이때 end-to-end VC model은 style encoder가 control하는 target waveform을 reconstruct하도록 training 되고, latent diffusion model은 noise에서 style encoder의 output을 sampling 하도록 training 됨
  2. 추론 시 latent diffusion model은 natural language prompt에 따라 target style vector를 생성하고, conversion model은 생성된 style vector로 condition 되어 음성을 reconstruction 함
- Linguistic content에서 residual style information을 최소화하기 위해 HuBERT를 사용하여 discrete token을 추출
  1. 이후 token을 deduplicate 하여 differential duration predictor를 통해 각 token의 duration을 예측
  2. 다음으로 discrete token을 $k$-means center embedding으로 대체하여 linguistic content에 대한 relative positional information을 제공
- 더 나은 prosody modeling을 위해 phoneme-level prosody representation을 capture 하는 prosody encoder를 도입

< Overall of PromptVC >

Natural language prompt에 대한 style vector를 생성하는 latent diffusion model과 HuBERT 기반의 linguistic discrete token을 활용
결과적으로 기존보다 뛰어난 conversion 성능을 달성

2. Method

- System Overview

PromptVC는 전체적으로 conditional Variational AutoEncoder (cVAE)로 볼 수 있음
- Training phase에서,
  1. Posterior encoder와 prosody encoder는 각각 input mel-spectorgram을 frame-level latent variable과 phoneme-level prosody representation으로 변환함
  2. 이후 content encoder는 high-level linguistic representation을 모델링하여 prosody predictor를 통해 prosody representation을 예측함
  3. Differentiable duration predictor는 phoneme-level linguistic representation을 frame-level intermediate representation으로 변환함
    - 이때 target speech의 frame-level latent variable의 Evidence Lower BOund (ELBO)로 constraint 됨
  4. Style encoder는 input mel-spectrogram의 style information이 포함된 global vector를 추출하는 것을 목표로 함
    - 해당 vector는 differentiable duration predictor와 wave decoder의 condition으로 사용됨
  5. 최종적으로, wave decoder는 frame-level latent variable로부터 waveform을 생성함
- Inference phase에서 target style vector는 style encoder에서 추출되지 않고, text prompt를 condition으로 하는 latent diffusion model을 통해 얻어짐

- Linguistic Units Extraction

논문에서는 linguistic content를 source speech와 disentangle 하고 residual style information을 최소화하기 위해 discrete token에 기반한 linguistic content extraction을 활용함
- Discrete token은 일반적으로 semantic token으로써, $k$-means cluster가 있는 pre-trained HuBERT를 사용하여 source speech로부터 추출됨
  - 해당 frame-level token은 기존의 Phonetic PosteriorGram (PPG)와 달리 최소한의 content-independent information을 포함
- 추가적으로 동일한 linguistic content라도 duration은 style마다 다르므로, 동일한 semantic token을 deduplicate 하고 duration predictor를 사용하여 각 token duration을 예측함
- 이때 mispronunciation을 회피하기 위해, deduplicated token을 $k$-means cluster의 center embedding에 해당하는 linguistic unit으로 대체함
  - 해당 linguistic unit은 discrete deduplicated token보다 linguistic content의 position information을 더 많이 포함함
  - 결과적으로, 추출된 phoneme-level linguistic unit은 original style의 residual information을 최소화하고, 서로 다른 style condition의 다양한 duration을 반영할 수 있음

- Style Modeling

Style encoder는 multi-head self-attention과 temporal averaging을 사용하여 reference mel-spectrogram $\mathbf{x}_{mel}$에서 global style representation을 추출하여 output speech style을 control 함
- 구조적으로는 Meta-StyleSpeech의 architecture를 따르고, 추출된 style representation을 기반으로 linguistic content의 gain, bias를 align 하기 위해 Style-Adaptive Layer Normalization (SALN)을 추가함
- 추가적으로 논문은 natural-sounding speech를 위해 phoneme-level prosody를 capture 하는 additional prosody encoder를 도입함
  1. 아래 그림과 같이, prosody encoder는 4개의 WaveNet residual dilated block으로 구성
    - 각각의 block은 gated activation, skip connection이 있는 dilated convolution layer를 가짐
  2. 여기서 dilated block output은 frame-level representation sequence이므로 training 중에 linguistic content leakage가 발생할 수 있음
    - 이를 해결하기 위해, deduplication length를 phoneme-level duration으로 사용하여 frame-level representation을 phoneme-level prosody feature로 변환함
  3. 이때 prosody encoder와 prosody predictor에서 추출된 phoneme-level prosody를 constrain 하기 위해 KL-divergence를 적용할 수 있음:
    (Eq. 1) $\mathcal{L}_{pro}=\mathbb{E}_{q_{\phi}(z|\mathbf{x}_{mel})}\left[\log q_{\phi}(z|\mathbf{x}_{mel})-\log p_{\theta}(z|z_{text})\right]$
    - $z, z_{text}$ : 각각 prosody encoder, prosody predictor로 얻어지는 prosody representation
    - 결과적으로 해당 prosody modeling을 통해, one-to-many 문제를 완화 가능
- PromptVC는 deduplication length를 duration으로 사용하므로, 기존의 phoneme length보다 짧음
  1. 이때 FastSpeech의 hard expansion을 사용하면 duration prediction이 정확하지 않음
  2. 따라서 논문은 trainable upsampling layer를 가진 differentiable duration predictor를 채택함
    - 해당 layer는 예측된 duration을 사용하여 projection matrix를 학습해 linguistic hidden sequence를 differentiable manner로 frame-level로 확장할 수 있음
- 그 외의 content encoder, posterior encoder, wave decoder는 VITS의 architecture를 따름
  - 결과적으로 training objective는 기존 VITS objective에 prosody loss $\mathcal{L}_{pro}$를 추가하여 구성됨

- Generative Latent Diffusion

Natural language prompt를 사용하여 style voice conversion을 수행하기 위해, 논문은 style encoder에서 추출된 global style vector를 생성하는 latent diffusion model을 도입함
- Textual representation에 따라 latent diffusion model은 generation process를 multiple conditional diffusion step으로 나눔
  - 이때 natural language prompt를 이해할 수 있도록, semantic text understanding encoder인 ChatGLM2-6B를 사용하여 textual representation을 생성함
  - 이후 생성된 textual representation은 cross-attention manner로 latent diffusion model의 condition으로 사용됨
- Latent diffusion model input $\mathbf{x}_{t}$는, time $t$에서 noise의 standard deviation으로 parameterize 된 noise schedule을 사용하여 Gaussian diffusion process를 통해 original style vector $\mathbf{x}$를 corrupting 하여 얻어짐
- Latent diffusion model $\epsilon_{\theta}$의 training loss는 noise space의 mean squared error:
  (Eq. 2) $\mathcal{L}_{diff} = || \epsilon_{\theta}(\mathbf{x}_{t},\mathbf{c},t)-\epsilon||^{2}$
  - $\mathbf{c}$ : textual representation, $\epsilon$ : diffusion noise

3. Experiments

- Settings

Dataset : Multi-speaker Mandarin Corpus (internal)
Comparisons : StyleSpeech, MixEmo

- Results

Soft Voice Conversion by Reference Speech
- 전체적으로 PromptVC의 성능이 가장 우수한 것으로 나타남

Soft Voice Conversion by Natural Language Prompts
- ABX test 측면에서 reference와 prompt의 결과가 큰 차이를 보이지 않음

Ablation Study
- Prosody encoder를 제거하고, linguistic representation을 PPG로 대체하는 경우 성능 저하가 발생함

'Paper > Conversion' 카테고리의 다른 글

[Paper 리뷰] Wav2Vec-VC: Voice Conversion via Hidden Representations of Wav2Vec 2.0 (0)	2024.09.04
[Paper 리뷰] ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-Supervised Speech Representations (0)	2024.09.02
[Paper 리뷰] DreamVoice: Text-Guided Voice Conversion (0)	2024.08.31
[Paper 리뷰] FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion (0)	2024.08.28
[Paper 리뷰] StreamVC: Real-Time Low-Latency Voice Conversion (0)	2024.08.27

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts

PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts

1. Introduction

2. Method

- System Overview

- Linguistic Units Extraction

- Style Modeling

- Generative Latent Diffusion

3. Experiments

- Settings

- Results

'Paper > Conversion' 카테고리의 다른 글

티스토리툴바