[Paper 리뷰] LM-VC: Zero-Shot Voice Conversion via Speech Generation based on Language Models

티스토리 뷰

Paper/Conversion

[Paper 리뷰] LM-VC: Zero-Shot Voice Conversion via Speech Generation based on Language Models

feVeRin 2025. 7. 7. 17:04

LM-VC: Zero-Shot Voice Conversion via Speech Generation based on Language Models

Zero-shot voice conversion을 위해 language model을 활용할 수 있음
LM-VC
- Source linguistic content와 target speaker timbre를 recover 하는 coarse token과 converted speech의 acoustic detail을 reconstruct 하는 fine token을 활용
- Content preservation과 disentanglement를 위해 masked prefix Language Model을 적용
- 추가적으로 sampling error를 alleviate 하기 위해 local acoustic relation을 capture 하는 window attention을 가진 external Language Model을 도입
논문 (Signal Processing Letters 2023) : Paper Link

1. Introduction

Voice Conversion (VC)는 linguistic content를 maintain 하면서 source speaker speech를 target speaker로 convert 하는 것을 목표로 함
- 특히 zero-shot VC는 desired speaker의 하나의 utterance 만으로 VC를 수행함
  - 이를 위해서는 target speaker timbre와 speech component를 효과적으로 disentangle 해야 함
- 대표적으로 NANSY는 Self-Supervised Learning (SSL) model을 통해 linguistic content를 추출하고 Speaker Verification (SV) model을 통해 speaker representation을 추출하여 VC를 수행함
  - BUT, 해당 방식은 disentanglement process의 limited capacity로 인해 low speaker similarity를 가지는 unseen speaker에 대해서는 generalize 되지 않음
- 한편으로 zero-shot audio generation을 위해 Language Model (LM)을 활용할 수도 있음
  - AudioLM, VALL-E, SPEAR-TTS 등에서 SSL model은 audio에서 linguistic content를 추출하고 neural codec은 high-quality audio를 low bitrate로 reconstruct 하는 역할을 수행함

-> 그래서 LM을 zero-shot VC task에 적용한 LM-VC를 제안

LM-VC
- Content와 speaker timbre를 recover 하는 coarse acoustic token을 생성한 다음, fine acoustic detail을 reconstruct 하는 two-stage framework를 활용
- Linguistic content와 better speech disentanglement를 위해 coarse acoustic modeling에 mask prediction strategy를 적용하는 Masked Prefixed Language Model (MPLM)을 도입
- 추가적으로 generation process의 sampling error를 alleviate 하기 위해 acoustic token의 local context를 capture 하는 window attention을 가진 External Language Model (ELM)을 적용
  - ELM과 MPLM은 shallow fusion을 통해 collaborate 하여 target speech를 생성함
- 최종적으로 Prefix Language Model (PLM)을 통해 non-autoregressive manner로 fine acoustic token을 coarse token으로부터 reconstruct

< Overall of LM-VC >

MPLM, ELM, PLM의 3가지 language model을 활용한 zero-shot VC model
결과적으로 기존보다 뛰어난 성능을 달성

2. Method

- Overview

LM-VC는 MPLM, ELM, PLM의 3가지 LM으로 구성됨
- 먼저 language modeling에 앞서 HuBERT와 SoundStream을 통해 semantic token $\mathbf{s}=\{s_{1},s_{2},...,s_{T_{s}}\}$와 acoustic token $\mathbf{a}=\{a_{1}^{1},a_{1}^{2},...,a_{1}^{L},a_{2}^{1},...,a_{T_{a}}^{L}\}$을 각각 추출함
  - $T_{s}, T_{a}$ : sequence length, $L$ : SoundStream의 quantizer 수
  - 이후 LM-VC는 AudioLM과 같이 coarse, fine acoustic model을 sequentially perform 함
- Coarse Acoustic Modeling
  1. MPLM은 source, target speaker speech에서 semantic token $\{\mathbf{s},\tilde{\mathbf{s}}\}$를 사용하고, target speaker speech에서 first-layer acoustic token $\tilde{\mathbf{a}}_{1}^{1}$을 사용함
  2. 이를 기반으로 $p(a_{t}^{1}|\tilde{\mathbf{s}},\mathbf{s},\tilde{\mathbf{a}}^{1},\mathbf{a}_{1:t}^{1})$을 따라 target speech의 acoustic token $\mathbf{a}^{1}$을 autoregressively generate 함
    - 이때 ELM은 $p(a_{t}^{1}|\mathbf{a}_{t-w:t}^{1})$을 따라 window length $w$로 MPLM과 collaborate 함
- Fine Acoustic Modeling
  1. PLM은 First-layer acoustic token을 input으로 하여 fine acoustic token을 layer-by-layer로 non-autoregressively generate 함
    - Source, target speech의 semantic acoustic token은 PLM의 prompt로 취급됨
  2. 즉, 해당 process는 $l\in [2,L]$에 대해 $p(\mathbf{a}^{l}|\tilde{\mathbf{s}},\mathbf{s},\tilde{\mathbf{a}},\mathbf{a}^{1:l-1},l)$과 같음
  3. PLM은 VALL-E를 따라 bidirectional attention을 가지는 multi-layer Transformer로 구성됨
- 최종적으로는 SoundStream을 통해 acoustic token에서 waveform을 reconstruct 함
  - Two-stage modeling에서 coarse acoustic modeling은 linguistic content와 speaker timbre를 recover 하고 fine acoustic modeling은 acoustic fine detail을 생성함

- Masked Prefix Language Model

LM은 multi-layer modeling 중에 network가 깊어짐에 따라 linguistic content가 lost 되고 lengthy speech input으로 인해 contextual information을 학습하지 못하므로 unnatural pronunciation을 생성할 수 있음
- 한편으로 semantic token 역시 speaker-related information을 일부 포함하고 있으므로 해당 inadequate decoupling으로 인해 low speaker similarity가 나타날 수 있음
  - 따라서 이를 해결하기 위해 논문은 2가지의 attention mask를 활용한 multi-layer Transformer로 구성되는 Masked Prefix Language Model (MPLM)을 도입함
- 특히 MPLM은 contextual learning을 향상하기 위해 surrounding context를 기반으로 masked token을 restore 하는 mask prediction strategy를 활용함
  1. 먼저 semantic token sequence $\mathbf{s}=\{s_{1},s_{2},...,s_{T}\}$가 주어지면, ratio $r$의 start index로 token을 randomly select 하고 $l$ step span을 $[M]$ token으로 mask 함
  2. Masking 이후에는 corrupted semantic token $\mathbf{s}_{mask}$를 input으로 하여 masked token을 recover 함
    - 이때 bidirectional attention mask를 사용하여 MPLM이 양 방향에서 contextual information을 capture 할 수 있도록 함
  3. 그러면 masked token에 대한 negative log-likelihood loss는:
    (Eq. 1) $ \mathcal{L}_{mask}=-\log \prod_{t\in M}p_{MPLM}\left(s_{t}|\mathbf{s}_{mask},t\right)$
- Acoustic generation의 경우, mask prediction strategy는 target speaker speech에서 speaker timbre를 exclusively capture 하면서 corrupted semantic sequence에서 content를 추출하도록 함
  1. 이를 통해 model은 better contextual information을 학습하고 semantic token에서 information bottleneck을 implicitly create 하여 disentanglement를 지원함
    - 특히 training 시에는 speech clip을 acoustic prompt로 explicitly use 하지 않음
  2. 대신 MPLM은 previous acoustic sequence $\mathbf{a}_{1:t-1}^{1}$을 acoustic prompt로 사용하여 fine-grained speaker information을 capture 하고 $a_{t}^{1}$을 autoregressively generate 함
    - 이때 논문은 unidirectional attention을 사용하여 left-to-right LM objective를 achieve 하고 acoustic token $a_{t}^{1}$은 previous sequence $\mathbf{a}_{1:t-1}^{1}$과 semantic prefix $\mathbf{s}_{mask}$에만 attend 함
  3. 결과적으로 얻어지는 loss는:
    (Eq. 2) $ \mathcal{L}_{ar}=-\log \prod_{t=0}^{T_{a}-1}p_{MPLM}\left(a_{t}^{1}|\mathbf{a}_{1:t-1}^{1}, \mathbf{s}_{mask},t\right)$
    - $T_{a}$ : acoustic token sequence length
- Training 시 semantic recovery와 acoustic generation은 $\mathcal{L}_{mask}+\mathcal{L}_{ar}$로 simultaneously perfom 됨

- External Language Model

MPLM의 generation process에서 language model sampling의 diversity로 인해 unnatural pronunciation과 speech quality degradation이 발생할 수 있음
- MPLM은 generation process에서 guidance가 부족하기 때문
  1. 한편으로 specific length의 speech segment 내에서 adjacent speech frame은, Wav2Vec 2.0을 따라 same local context를 share 함
  2. 즉, previous time step의 frame을 통해 speech frame을 predict 할 수 있음
- 이를 기반으로 논문은 generation process에서 contextual guidance를 제공하고 local acoustic relation을 capture 하기 위해 External Language Model (ELM)을 도입함
  1. ELM은 local contextual information을 encode 하고 window length $w$로 distribution $p(a_{t}^{1}|\mathbf{a}_{t-w:t-1}^{1})$을 predict 하기 위해 window attention을 사용함
  2. 여기서 ELM의 objective는:
    (Eq. 3) $\mathcal{L}_{war}=-\log \prod_{t=0}^{T_{a}-1}p_{ELM}\left(a_{t}^{1}| \mathbf{a}_{t-w:t-1}^{1},t\right)$
- Training 시에는 MPLM과 ELM을 separately train 하고, 추론 시에는 ELM과 MPLM을 collaborate 하여 preceding acoustic token의 local context로 condition 된 acoustic token을 생성함
- 해당 collaboration은 fusion weight $\lambda$를 사용한 shallow fusion으로 수행됨:
  (Eq. 4) $a_{t}^{1}=\arg\max_{a_{t}^{1}}\left[\log p_{MPLM}\left( a_{t}^{1}|\mathbf{a}_{1:t-1}^{1},\tilde{\mathbf{a}}^{1}, \mathbf{s},\tilde{\mathbf{s}},t\right)\right] +\lambda \log p_{ELM}\left(a_{t}^{1}| \mathbf{a}_{t-w:t-1}^{1},t\right)$

3. Experiments

- Settings

Dataset : LibriTTS
Comparisons : YourTTS, AudioLM

- Results

전체적으로 LM-VC의 성능이 가장 우수함

Validation of ELM
- 다양한 window length $w=10,20,30,40,50$와 fusion weight $\lambda=0.1,0.3,0.5,0.8,1$에 대해, LM-VC는 $w=20, \lambda=0.3$의 setting에서 최적의 성능을 달성함

Varying Duration
- Speaker prompt duration이 3~4s 일 때 최고의 intelligibility를 달성함

'Paper > Conversion' 카테고리의 다른 글

[Paper 리뷰] LinearVC: Linear Transformations of Self-Supervised Features through the Lens of Voice Conversion (0)	2025.07.22
[Paper 리뷰] ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech (0)	2025.07.09
[Paper 리뷰] StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion (0)	2025.07.03
[Paper 리뷰] EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice Conversion (0)	2025.06.21
[Paper 리뷰] SEVC: Voice Conversion via Structural Entropy (0)	2025.05.30

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] LM-VC: Zero-Shot Voice Conversion via Speech Generation based on Language Models

LM-VC: Zero-Shot Voice Conversion via Speech Generation based on Language Models

1. Introduction

2. Method

- Overview

- Masked Prefix Language Model

- External Language Model

3. Experiments

- Settings

- Results

'Paper > Conversion' 카테고리의 다른 글

티스토리툴바