[Paper 리뷰] Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement

티스토리 뷰

Paper/Conversion

[Paper 리뷰] Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement

feVeRin 2025. 2. 15. 17:28

Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement

Voice imitation은 annotated data에 크게 의존하고 timbre/style을 disentangle 하는데 어려움이 있음
Vevo
- Content-Style Modeling을 통해 text/speech content token을 input으로 하고 style reference로 prompt 되는 content-style token을 생성
- Acoustic Modeling을 통해 content-style token을 기반으로 flow-matching transformer를 사용해 timbre reference로 prompt 되는 acoustic representation을 생성
- Content/Content-style token을 얻기 위해 timbre, style, linguistic content를 progressively decouple 하는 fully self-supervised approach를 도입
논문 (ICLR 2025) : Paper Link

1. Introduction

Voice imitation은 speaker identity imitation, speaking style/accent/emotion imitation, zero-shot text-to-speech (TTS) 등의 task를 포괄함
- 기존에는 controllable imitation을 위해 speech를 multiple sub-space로 factorizing하는 방식을 사용했음
- 대표적으로 linguistic content, style, timbre의 3가지 attribute로 decompose할 수 있고, 이에 기반한 주요 zero-shot speech generation task는 다음과 같음:
  1. Timbre Imitation
    - Speech를 source로 하여 linguistic content/speaking style은 preserve하면서 reference speech의 timbre만 imitation 하는 것
    - 즉, speech의 spectral aspect만 convert함
  2. Style Imitation
    - Speech를 source로 하여 linguistic content/timbre는 preserve 하면서 reference speech의 speaking style만 imitation 하는 것
    - 즉, accent conversion/emotion conversion에 해당
  3. Voice Imitation
    - Conversion task의 경우 speech, synthesis task의 경우 text를 source로 하여 content를 preserve 하면서 reference speech의 timbre/style을 모두 imitation 하는 것
    - 즉, spectral/prosodic aspect를 모두 convert 하는 voice conversion이나 zero-shot TTS에 해당
- 해당 imitation task를 수행하기 위해 parallel corpus, disentangled representation learning, large-scale in-context learning 등이 활용되었지만 다음의 한계가 있음:
  1. Style imitation의 경우 annotated data의 supervision에 크게 의존함
    - BUT, 해당 data는 collection과 scale-up이 어려움
  2. Timbre, style decoupling은 여전히 insufficient 함
    - 추가적인 fine-tuning stage나 perturbation이 필요하므로 independent control이 어려움

-> 그래서 style, timbre를 모두 control 하는 VErsatile zero-shot VOice imitation framework인 Vevo를 제안

Vevo
- Style reference로 speech prompt가 주어졌을 때, input content token 또는 input text로부터 content-style token을 생성하는 Content Style Modeling을 도입
  - 이때 decoder-only autoregressive transformer를 사용해 style을 modeling 함
- Timbre reference로 speech prompt가 주어졌을 때, content-style token으로부터 mel-spectrogram과 같은 acoustic representation을 생성하는 Acoustic Modeling을 활용
  - Flow-matching transfromer를 활용하여 timbre-controllable generation을 수행함
- 추가적으로 content-style/content token을 얻기 위해 timbre, style, linguistic content를 gradually decouple 하는 self-supervised method를 설계
  1. HuBERT의 tokenizer로 VQ-VAE를 채택하여 timbre를 filter out 하고 content-style token을 생성함
  2. VQ-VAE codebook의 vocabulary size를 information bottleneck으로 취급하여 timbre/style information을 filter out 한 content token을 얻음
    - 이때 duration reduction을 적용하여 unit-level duration과 같은 style pattern을 further remove 함

< Overall of Vevo >

Timbre, style, linguistic content를 progressively decouple 하는 fully self-supervised, unified zero-shot voice imitation framework
결과적으로 기존보다 뛰어난 imitation, conversion 성능을 달성

2. Method

- VQ-VAE Tokenizer for HuBERT

Motivation
- 다양한 speech attribute의 representation을 disentangle 하기 위해 논문은 VQ-VAE tokenizer를 채택함
- 특히 HuBERT에 기반한 Self-Supervised Learning (SSL) representation에 VQ-VAE를 적용
  1. HuBERT의 continuous hidden feature는 timbre, style, linguistic content에 대한 rich information을 포함하고 있으므로 mel-spectrogram과 같은 acoustic representation을 reconstruction 하는데 적합하기 때문
  2. Speech에 대한 self-supervised learning을 high-level knowledge distillation으로 처리할 수 있기 때문
- 즉, VQ-VAE를 사용하면 SSL feature에 대한 information filtering과 disentangling을 향상할 수 있음

Architecture
- VQ-VAE는 encoder, vector quantization (VQ), decoder로 구성됨
- 먼저 vocabulary size가 $K$인 codebook $\mathbf{E}=[e_{1},e_{2},...,e_{K}]$가 주어지고 HuBERT hidden feature $x$를 input이라고 하면, 다음 세 module을 통해 reconstructed $\hat{x}$를 얻을 수 있음:
  (Eq. 1) $z_{e}(x)=\text{Encoder}(x),$
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,z_{q}(x)=e_{k},\,\, \text{where}\,\, k=\arg\min_{j}|| z_{e}(x)-e_{j}||_{2},$
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\hat{x}=\text{Decoder}(z_{q}(x))$
  - $z_{q}(x)$ : VQ 이후의 $z_{e}(x)$의 qunatized representation으로써, token에 해당
- Loss function은 weight $\lambda$를 가지는 recontruction loss와 weight $\beta$를 가지는 quantization loss로 구성됨:
  (Eq. 2) $\mathcal{L}=\lambda||x-\hat{x}||_{2}^{2}+\beta|| z_{c}(x)-z_{q}(x)||_{2}^{2}$
- 이때 $z_{q}(x)$에 대한 real gradient는 정의되지 않고, straight-through gradient estimator나 Exponential Moving Average (EMA)를 optimization algorithm으로 활용할 수 있음
  - 논문은 SoundStream을 따라 EMA algorithm을 채택
- 한편으로 VQ-VAE model은 down/upsampling이 포함되지 않으므로 input $x$의 sequence length가 preserve 됨
  - 즉, 50Hz frame-level HuBERT feature는 VQ 이후에도 50Hz frame-level token을 제공함

Analysis of the Vocabulary Size of Codebook
- VQ-VAE를 통한 HuBERT hidden feature quantization은 lossy compression으로 볼 수 있으므로, AutoVC와 같이 VQ codebook의 vocabulary size를 information bottleneck으로 취급할 수 있음
- Input $x$가 sufficient speech information을 가지고 있는 경우, vocabulary size $K$를 infinity에서 $0$으로 줄인다고 하자:
  1. 즉, $K\rightarrow \infty$ 일 때, bottleneck이 too wide 하므로 loss 없이 all information을 accommodate 할 수 있음
  2. $K$가 감소함에 따라 timbre와 관련된 spectral feature나 style과 관련된 prosodic feature와 같은 low-level acoustic information이 lost 됨
    - Certain reduced $K$에 대해 linguistic content와 같은 highest-level abstract information만이 $x$내에 preserve 됨
  3. $K\rightarrow 0$인 경우, bottleneck이 exceedingly narrow 하므로 linguistic content와 같은 high-level information도 filtered out 됨
- 실제로 zero-shot timbre imitation task에서 $K$가 progressively decrease 될 때, timbre information이 가장 먼저 filter out 되므로 content-style token을 얻을 수 있음 ($K=K_{s}$)
  1. 마찬가지로 대부분의 style information이 filtering 되고 나면 highest-level linguistic content만이 preserve 되므로, content token을 얻을 수 있음 ($K=K_{c}$)
  2. 여기서 $K=K_{s}$인 VQ-VAE를 content-style tokenizer $\mathbf{Q}_{s}$라고 하고 $K=K_{c}$인 VQ-VAE를 content tokenizer $\mathbf{Q}_{c}$라고 함

- Content-Style Modeling (Content to Content-Style)

Content-style modeling은 speech/text의 content token을 style reference로 prompt 되는 content-style token으로 변환하는 것을 목표로 함
- 해당 과정은 sequence-to-sequence generation task로 취급할 수 있고, 논문은 이를 위해 decoder-only autoregressive transformer를 채택함
Duration Reduction
- Speech input $u$가 주어졌을 때 content/content-style token을 각각 $\mathbf{Q}_{c}(u), \mathbf{Q}_{s}(u)$라고 하자
  - 둘 다 equal length를 가지는 50Hz frame-level representation
- Content-style modeling에서 $\mathbf{Q}_{s}(u)$는 output으로 사용됨
  1. 이때 $\mathbf{Q}_{c}(u)$를 사용하는 대신 Duration Reduction을 적용하여 reduced $\mathbf{Q}'_{c}(u)$를 input으로 생성함
  2. 이를 위해 $\mathbf{Q}_{c}(u)$의 consecuitve duplicate unit을 하나로 merge 함
    - e.g.) $\mathbf{Q}_{c}(u)=[e_{1},e_{1},e_{1},e_{2},e_{3},e_{3}]$이면 $\mathbf{Q}'_{c}(u)=[e_{1},e_{2},e_{3}]$으로 reduce 됨
- 해당 duration reduction은 다음의 장점을 가짐:
  1. Unit-level duration과 같은 $\mathbf{Q}_{c}(u)$ 내의 style-specific information을 further filtering 할 수 있음
  2. $\mathbf{Q}_{c}(u)$와 $\mathbf{Q}_{s}(u)$가 style modeling 전후에 발생하는 sequence length change와 관련된 문제를 해결할 수 있음
  3. Overall sequence length를 shorten 하여 transformer의 context modeling을 향상함
Global Style Encoder
- Vevo는 speech input $u$에서 global style guidance를 capture 하여 style embedding $\mathbf{g}(u)$를 생성하는 global style encoder를 도입함
  1. 특히 inference speed를 optimize 하고 memory usage를 reduce 하려는 경우, 해당 style embedding만을 활용하는 reference-global-guided continuation을 활용할 수 있음
  2. 그 외에도 style imitation 성능을 maximize 하기 위해 $\mathbf{g}(u)$ 외에도 style reference의 content style token을 input sequence에 append 하는 reference-style-enhanced continuation을 활용할 수도 있음
- Global style enocder는 WavLM-based representation layer와 TDNN-based feature extraction layer로 구성됨

Training and Inference
- Training 중에 speech data에 대한 self-supervised learning을 수행함
  1. Transformer에 대한 input sequence는 $[\langle \text{SOS}\rangle, \mathbf{Q}'_{c}(u),\langle\text{SEP}\rangle,\mathbf{g}(u),\langle\text{SEP}\rangle,\mathbf{Q}_{s}(u)]$와 같이 구성됨
  2. 여기서 last $[\langle \text{SEP}\rangle,\mathbf{Q}_{s}(u)]$에 대해서만 next token prediction을 수행하고, 이 경우 ground-truth는 $[\mathbf{Q}_{s}(u),\langle\text{EOS}\rangle]$가 됨
    - $\langle\text{SOS}\rangle, \langle\text{SEP}\rangle, \langle \text{EOS}\rangle$는 lanugage model에서 special token으로 처리됨
- Inference 시에는 source speech $u_{i}$와 style reference $u_{sr}$에 대해, autoregressive generation을 위한 input sequence $[\langle\text{SOS}\rangle,\mathbf{Q}'_{c}(u_{sr}\oplus u_{i}),\mathbf{g}(u_{sr}),\mathbf{Q}_{s}(u_{sr})]$를 전달하여 reference-style-enhanced continuation을 수행할 수 있음
  - $\oplus$ : concatenation
- Reference-global-guided continuation의 경우, input sequence는 $[\langle\text{SOS}\rangle,\mathbf{Q}'_{c}(u_{i}),\mathbf{g}(u_{sr})]$가 됨

- Acoustic Modeling (Content-Style to Acoustic)

Acoustic modeling은 timbre reference를 prompt로 하여 content-style token을 mel-spectrogram으로 변환함
- Vevo는 high-quality acoustic representation을 얻기 위해 flow-matching transformer를 채택함
- Training 중에 speech $u$와 mel-spectrogram $y_{1}$이 주어지면, $y_{1}$의 일부를 timbre reference $y^{ctx}_{1}$로 select 한 다음, $y_{1}^{ctx}$와 content-style token $\mathbf{Q}_{s}(u)$을 condition으로 하여 나머지 부분 $y_{1}^{mis}$를 reconstruct 함
  1. 즉, conditional probability $p(y_{1}^{mis}|y_{1}^{ctx}, \mathbf{Q}_{s}(u))$를 modeling 하는 것을 목표로 함
  2. 이를 위해 VoiceBox를 따라 temporal span masking strategy $y_{1}^{mis}=m\odot y_{1},\,\,y_{1}^{ctx}=(1-m)\odot y_{1}$를 채택함
    - $m$ : $y_{1}$과 length가 동일한 binary temporal mask
    - $\odot$ : element-wise multiplying operation
- Inference 중에 source speech $u_{i}$와 timbre reference $u_{tr}$이 주어지면 source의 모든 mel-spectrogram이 $y_{1}^{mis}$와 같이 mask 됨
  1. 이때 input condition은 timbre reference의 mel-spectrogram $y_{1}^{ctx}$과 concatenate 된 content-style token $\mathbf{Q}_{s}(u_{i}\oplus u_{tr})$이 됨
  2. 이를 통해 generated target에서 $u_{i}$의 linguistic content/style과 $u_{tr}$의 timbre를 preserve 할 수 있음
- 논문은 optimal transport path에 기반한 conditional flow matching을 사용하고, 이때 loss function은:
  (Eq. 3) $\mathcal{L}_{cfm}=\mathbb{E}_{t,m,y_{0},y_{1}}\left|\left| \frac{dy_{t}}{dt}-f_{t}(y_{t},t,y_{1}^{ctx},\mathbf{Q}_{s}(u))\right|\right|_{2}^{2},\,\, \text{where} \,\,y_{t}=(1-(1-\sigma)t)\cdot y_{0}+t\cdot y_{1}$
  - $t$ : uniform distribution $\mathcal{U}(0,1)$에서 sampling 된 time step
  - $y_{0}$ : standard Gaussian distribution에서 sampling 된 noise
  - $\sigma$ : optimal transport path의 small constant
- 한편 content-style token $\mathbf{Q}_{s}(u)$와 mel-spectrogram $y_{1}$의 sampling rate는 다를 수 있음
  1. 이때 singal resampling operation을 사용하여 align 하고, adding operation을 통해 frame-level feature를 fuse 함
  2. Mel-spectrogram을 얻은 이후 BigVGAN vocoder를 사용하여 waveform을 생성함

- Vevo for Various Zero-shot Imitation Tasks

Content-style modeling과 acoustic modeling에서 각각 pre-trained model $\mathcal{M}_{style}, \mathcal{M}_{acoustic}$을 얻었다고 하자
- 그러면 inference pipeline을 adjust 하여 Vevo를 다양한 zero-shot imitation task에 적용할 수 있음
  - $\xrightarrow{u}\mathcal{M}$ : model $\mathcal{M}$이 $u$에 의해 prompt 되어 generate 하는 것을 의미
- Source speech $u_{i}$ 또는 text $\mathcal{T}_{i}$가 있고 reference $u_{r}$이 주어졌을 때, 다음의 variant를 사용하여 Vevo에 zero-shot timbre/style/voice imitation task를 적용할 수 있음
  1. Vevo-Timbre : timbre imitation 용으로써, $\mathbf{Q}_{s}(u_{i})\xrightarrow{u_{r}}\mathcal{M}_{acoustic}$
  2. Vevo-Style : style imitation 용으로써, $\mathbf{Q}'_{c}(u_{i})\xrightarrow{u_{r}}\mathcal{M}_{style}\xrightarrow{u_{i}}\mathcal{M}_{acoustic}$
  3. Vevo-Voice : voice conversion 용으로써, $\mathbf{Q}'_{c}(u_{i})\xrightarrow{u_{r}}\mathcal{M}_{style}\xrightarrow{u_{r}}\mathcal{M}_{acoustic}$
  4. Vevo-TTS : synthesis 용으로써, $\widetilde{\mathbf{Q}}_{c}(\mathcal{T}_{i})\xrightarrow{u_{r}}\widetilde{\mathcal{M}}_{style}\xrightarrow{u_{r}}\mathcal{M}_{acoustic}$
    - $\widetilde{\mathbf{Q}}_{c}(\mathcal{T}_{i})$ : $\mathcal{T}_{i}$에 대한 tokenization
    - $\widetilde{\mathcal{M}}_{style}$ : text를 input으로 하는 pre-trained content-style model

3. Experiments

- Settings

Dataset : Audiobook, Common Voice, Emotional/Accented dataset
Comparisons
- Voice Conversion : HierSpeech++, LM-VC, UniAudio, NaturalSpeech
- Style Imitation : ASR-AC, VoiceShop, Conv-Speak, Emovox
- Synthesis : CosyVoice, MaskGCT, VALL-E, VoiceBox, VoiceCraft

- Results

Effect of the Vocabulary Size of the VQ-VAE Tokenizer
- HuBERT continuous hidden feature는 timbre, style, linguistic content에 대한 rich information을 가짐
- ASR fine-tuning 이후 PPG feature와 ASR token은 linguistic content information을 효과적으로 retain 하지만, timbre/style information은 감소함
- VQ-VAE와 비교하여 $K$-means token은 $K=1024$로 동일할 때 더 낮은 intelligibility, S-SIM을 보임
- VQ-VAE token은 $16384$와 같은 large vocabulary size에 대해 더 많은 timbre information을 retain 함
  1. $K=4096$으로 감소하는 경우, timbre information은 filtered out 되지만 style information은 relatively retain 됨
  2. $K=32$로 감소하는 경우, timbre 뿐만 아니라 style information도 filtered out 됨
  3. $K=16, 8$로 감소하는 경우, high-level linguistic content도 filtered out 됨
- 결과적으로 논문은 content tokenizer에 대해 $K_{c}=32$, content-style tokenizer에 대해 $K_{s}=4096$의 VQ-VAE를 채택

서로 다른 HuBERT representation으로 train된 $\mathcal{M}_{acoustic}$의 성능

Zero-shot Timbre Imitation and Voice Imitation (Conversion Task)
- 전체적으로 conversion task에 대해 Vevo가 가장 우수한 성능을 보임

Zero-shot Style Imitation
- Vevo는 accent/emotion corpus에 대한 fine-tuning 없이도 기존보다 우수한 성능을 보임
- Text를 additional supervision으로 제공하는 경우 intelligibility, accent imitation을 향상할 수 있음

Zero-shot Voice Imitation (Synthesis Task)
- TTS 측면에서도 Vevo의 성능이 가장 뛰어남

Effect of the Duration Reduction and Different Inference Modes
- Duration reduction은 inference input length를 줄이고, duration conversion (DDUR) 측면에서 이점을 가짐
- Reference-global-guided continuation은 약간의 성능 저하가 있으나 sequence length를 상당히 줄일 수 있음
  - 즉, inference memory/speed 측면에서 장점을 가짐

'Paper > Conversion' 카테고리의 다른 글

[Paper 리뷰] kNN-VC: Voice Conversion with Just Nearest Neighbors (2)	2025.03.24
[Paper 리뷰] ExVC: Leveraging Mixture of Experts Models for Efficient Zero-Shot Voice Conversion (0)	2025.03.19
[Paper 리뷰] StableVC: Style Controllable Zero-Shot Voice Conversion with Conditional Flow Matching (0)	2025.01.28
[Paper 리뷰] VoiceMixer: Adversarial Voice Style Mixup (0)	2025.01.27
[Paper 리뷰] Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features (0)	2024.12.28

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement

Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement

1. Introduction

2. Method

- VQ-VAE Tokenizer for HuBERT

- Content-Style Modeling (Content to Content-Style)

- Acoustic Modeling (Content-Style to Acoustic)

- Vevo for Various Zero-shot Imitation Tasks

3. Experiments

- Settings

- Results

'Paper > Conversion' 카테고리의 다른 글

티스토리툴바