[Paper 리뷰] CASC-XVC: Zero-Shot Cross-Lingual Voice Conversion with Content Accordant and Speaker Contrastive Losses

티스토리 뷰

Paper/Conversion

[Paper 리뷰] CASC-XVC: Zero-Shot Cross-Lingual Voice Conversion with Content Accordant and Speaker Contrastive Losses

feVeRin 2025. 5. 19. 17:35

CASC-XVC: Zero-Shot Cross-Lingual Voice Conversion with Content Accordant and Speaker Contrastive Losses

Cross-Lingual Voice Conversion은 language mismatch와 train-test inconsistency로 인해 한계가 있음
CASC-XVC
- Content accordant loss와 Speaker contrastive loss를 incorporate 하고 content disentanglement를 위해 shared self-supervised learning representation과 information perturbation을 도입
- 서로 다른 language의 utterance pair를 사용해 module을 update 하는 cross-lingual fine-tuning process를 적용
논문 (ICASSP 2025) : Paper Link

1. Introduction

Voice Conversion (VC)는 speaker aspect를 modify 하면서 linguistic content는 preserve 하는 것을 목표로 함
- Cross-lingual Voice Conversion (XVC)는 VC task에서 source, target speaker가 서로 다른 language를 사용하는 경우를 가정함
  - BUT, same speaker의 multilingual parallel speech data가 부족하므로 XVC는 VC 보다 낮은 품질을 보임
- 특히 zero-shot XVC를 위해서는 speech에서 content, speaker information을 효과적으로 disentangle 해야 함
  1. 이를 위해 Phonetic Posterior Gram (PPG)나 Variational AutoEncoder (VAE)를 활용할 수 있음
    - BUT, PPG 추출을 위해 pre-trained Automatic Speech Recogntion (ASR) model을 활용하므로 다양한 language로 확장하기 어려움
  2. VAE의 경우 Vector Quantization (VQ)와 같은 pre-defined bottleneck structure와 additional weight parameter에 의존하므로 language 간의 phonetic information, pronunciation pattern을 반영하지 못함
- 추가적으로 disentanglement에 기반한 기존 방식은 train-test inconsistency 문제가 존재함
  1. Training 중에 same speaker는 source/target을 모두 reconstruct 하는 데 사용되지만 test 시에는 다른 speaker의 target speech를 사용하므로 disentangling 성능이 저하됨
  2. 이를 위해 cyclic consistency loss를 도입할 수 있지만, XVC task에 entirely applicable 하지 않음

-> 그래서 Content Accordant loss와 Speaker Contrastive loss를 활용한 zero-shot XVC model인 CASC-XVC를 제안

CASC-XVC
- Content Encoder가 consistent prior distribution을 구성하도록 force 하는 Content Accordant (CA) loss를 도입
- Same speaker가 서로 다른 language로 speaking 할 때 consistent speaker representation을 생성하도록 Speaker Encoder를 guide 하는 Speaker Contrastive (SC) loss를 incorporate
- Pre-trained XLSR-53에서 추출한 Self-Supervised Learning (SSL) representation으로 PPG를 대체하고 information perturbation을 도입해 disentanglement를 향상
- 추가적으로 서로 다른 language를 사용하는 speaker의 utterance pair를 randomly select 하여 추론 과정에서 사용되는 module을 update 하는 Cross-Lingual Fine-Tuning process를 적용

< Overall of CASC-XVC >

CA, SC loss와 Cross-Lingual Fine-Tuning을 활용한 zero-shot XVC model
결과적으로 기존보다 우수한 conversion 성능을 달성

2. Method

CASC-XVC는 FreeVC의 VITS structure를 따름
- 이때 논문은 multilingual SSL representation extractor인 XLSR-53과 joint ECAPA-TDNN speaker encoder를 기반으로 information perturbation을 적용하여 disentangle을 지원함
  - 추가적으로 intermediate representation을 constrain 하기 위해 mel-spectrogram predictor를 도입함
- Multilingual pre-training 이후에는 CA, SC loss를 활용하여 cross-lingual fine-tuning을 수행함

- Model Architecture

Content Encoder
- Content Encoder는 XLSR-53 model과 projection layer로 구성됨
  1. 먼저 XLSR-53은 speech waveform에서 1024-dimensional frame-level SSL representation을 추출하여 projection layer에 input 함
  2. Projection layer는 posterior encoder와 유사하게 SSL representation을 $d$-dimensional mean $\mu_{\theta}$와 variance $\sigma_{\theta}$로 mapping 하여 normal prior distribution $\mathcal{N}(\mathbf{z}_{\theta};\mu_{\theta},\sigma^{2}_{\theta})$를 얻음
- Multiple affine coupling layer로 구성된 normalizing flow는 variable change rule에 따라 content condition $\mathbf{c}$ 하에서 normal prior distribution에서 complex distribution $p_{\theta}(\mathbf{z}_{\phi}|\mathbf{c})$로의 invertible transformation을 allow 함
Mel-Spectrogram Predictor
- Mel-spectrogram predictor는 ReLU와 linear layer로 구성되고, posterior encoder output인 intermediate representation $\mathbf{z}_{\phi}$로부터 80-dimensional mel-spectrogram을 predict 함
- 한편으로 end-to-end framework에서 model은 intermediate representation을 predict 할 때 explicit supervision이 부족할 수 있음
  1. 이로 인해 unstable training, slow convergence, high-frequency information에 대한 inadequate modeling이 발생할 수 있음
  2. 따라서 논문은 mel-spectrogram predictor와 additional mel-spectrogram reconstruction loss를 채택함으로써 intermediate representation이 mel-related information을 잘 capture 하도록 함

- Training and Inference

논문은 Multilingual pre-training과 Cross-lingual fine-tuning의 two-stage training strategy를 도입함
- 먼저 pre-training stage에서는 SSL representation이 content, speaker information을 모두 포함하므로 NANSY의 information perturbation을 도입함
  1. 여기서 multilingual speech는 XLSR-53에 전달되기 전에 speaker identity를 modify 하도록 perturb 됨
  2. 이후 model은 original speech를 reconstruct 하도록 training 되고 unwanted speaker information을 discard 함
- Pre-training stage에서 generator의 training loss $\mathcal{L}_{pt}$는 adversarial loss $\mathcal{L}_{adv}$, feature matching loss $\mathcal{L}_{fm}$, $p_{\theta}(\mathbf{z}_{\phi}|\mathbf{c})$와 posterior distribution $q_{\phi}(\mathbf{z}_{\phi}|\mathcal{F}(\mathbf{y}))=\mathcal{N}(\mathbf{z}_{\phi};\mu_{\phi},\sigma^{2}_{\phi})$ 간의 KL loss $\mathcal{L}_{KL}$, reconstruction loss $\mathcal{L}_{rec}$로 구성됨
  1. 이때 논문은 mel predictor를 사용하므로 $\mathcal{L}_{rec}$를 weight parameter $\lambda_{rec}\,\,(0<\lambda_{rec}\leq 1)$을 사용하여 두 가지 part로 divide 함
  2. Input waveform의 mel-spectrogram $\mathbf{Y}$와 reconstructed waveform의 mel-spectrogram $\hat{\mathbf{Y}}$ 간의 $L1$ norm, $\mathbf{Y}$와 predicted mel-spectrogram 간의 $L1$ norm
    - Fine-tuning stage에서 후자는 discard 됨
  3. Multilingual pre-training 이후 model은 각 language의 speaker, content representation을 효과적으로 separate 할 수 있음
- 추론 시 posteriror encoder는 discard 되고 normalizing flow를 invert 하여 $\mathcal{N}(\mathbf{z}_{\theta};\mu_{\theta},\sigma_{\theta}^{2})$에서 sampling 된 $\mathbf{z}_{\theta}$를 condition으로 하는 target speaker embedding과 함께 intermediate representation $\hat{\mathbf{z}}$로 변환함
- Cross-lingual fine-tuning stage에서 mel predictor는 $\hat{\mathbf{z}}$에서 convert 된 speech의 mel-spectrogram을 predict 하는 데 사용되고, predictor weight는 frozen 됨

- Cross-Lingual Fine-Tuning with Content Accordant and Speaker Contrastive Losses

Train-Test inconsistency를 해결하기 위해 논문은 inference를 training에 integrate함
- 먼저 $A,B$의 2가지 language가 있다고 하자
  1. 여기서 language $A$의 source speaker에 대한 utterance를 $\mathbf{y}^{A}_{src}$, language $B$의 randomly selected target speaker에 대한 utterance를 $\mathbf{y}_{tgt}^{B}$라고 하자
    - $(\mathbf{y}_{src}^{A}, \mathbf{y}_{tgt}^{B})$ : XVC pair
  2. 그러면 model은 inference mode에서 $\mathbf{y}_{tgt}^{B}$의 speaker embedding을 사용하여 $\mathbf{y}^{A}_{src}$를 target speaker로 convert 함
    - 이는 target speaker가 다른 language인 XVC inference stage와 consistent 함
  3. Converted speech는 $\hat{\mathbf{y}}_{tgt}^{A}$, reconstructed speech는 $\hat{\mathbf{y}}_{src}^{A}$와 같음
- 한편으로 $\hat{\mathbf{y}}_{tgt}^{A}$는 true target을 가지지 않으므로 $\hat{\mathbf{y}}_{tgt}^{A}$를 통해 loss를 directly calculate 할 수 없음
  1. 특히 language difference는 speaker embedding에 영향을 미치므로 unintended foreign accent가 나타날 수 있음
  2. 따라서 논문은 $\hat{\mathbf{y}}^{A}_{tgt}$을 content, speaker 모두에 constrain 하면서 $\hat{\mathbf{y}}_{tgt}^{A}$가 $\mathbf{y}_{src}^{A}$에 close 한 content information을 포함하도록 함
  3. 추가적으로 $\hat{\mathbf{y}}^{A}_{tgt}$가 $\mathbf{y}_{tgt}^{B}$와 close 한 speaker characteristic을 포함해야 하므로 speaker encoding의 language effect를 minimize 해야 함
- 이를 기반으로 논문은 CA loss와 SC loss를 구성함
  1. CA loss $\mathcal{L}_{CA}$는 new prior distribution $p_{\psi}(\mathbf{z}_{\phi}|\mathbf{c}')$와 posterior distribution $q_{\phi}(\mathbf{z}_{\phi}|\mathcal{F}(\mathbf{y}^{A}_{src}))$ 간의 KL-divergence에 해당함
  2. $p_{\psi}(\mathbf{z}_{\phi}|\mathbf{c}')$는 $\hat{\mathbf{y}}_{tgt}^{A}$를 content encoder에 input 하여 얻어지고 content condition $\mathbf{c}'$은 $\hat{\mathbf{y}}_{tgt}^{A}$에서 re-extract 됨
    - 이는 $\mathbf{y}^{A}_{src}$의 $\mathbf{c}$와는 다름
  3. 결과적으로 $\mathcal{L}_{CA}$를 minimize 함으로써 content encoder에서 accordant content information을 얻을 수 있음:
    (Eq. 1) $\mathcal{L}_{CA}=\log q_{\phi}(\mathbf{z}_{\phi}|\mathcal{F}(\mathbf{y}_{src}^{A}))-\log p_{\psi}(\mathbf{z}_{\phi}|\mathbf{c}')$
    - $\mathbf{z}_{\phi}$ : $q_{\phi}(\mathbf{z}_{\phi}|\mathcal{F}(\mathbf{y}_{src}^{A}))$에서 sample 되는 값
- SC loss $\mathcal{L}_{SC}$는 $\mathbf{y}^{B}_{tgt}, \mathbf{y}_{src}^{A}$의 speaker embedding 간의 contrastive distance에 해당함
  1. 먼저 각 XVC pair에 대해 converted speech $\hat{\mathbf{y}}_{tgt}^{A}$의 mel-spectrogram이 speaker encoder에 전달되어 speaker embedding $\mathbf{e}_{tgt}^{A}$를 얻음
    - $\mathbf{y}_{src}^{A}$에서 얻어진 speaker embedding은 $\mathbf{e}_{src}^{A}$, $\mathbf{y}_{tgt}^{B}$에서 얻어진 speaker embedding은 $\mathbf{e}_{src}^{A}$
  2. 이때 model은 $\mathbf{e}_{tgt}^{A},\mathbf{e}_{tgt}^{B}$ 간의 distance를 minimize 하면서 $\mathbf{e}_{tgt}^{A}$가 $\mathbf{e}_{src}^{A}$와 distant 하도록 해야 함
    - 특히 $\mathcal{L}_{SC}$를 minimize 함으로써 speaker encoder는 speaker distinguish 뿐만 아니라 same speaker가 서로 다른 language로 speaking 할 때도 consistent speaker representation을 얻을 수 있음
  3. 결과적으로 speaker contrastive loss $\mathcal{L}_{SC}$는:
    (Eq. 2) $\mathcal{L}_{SC}=\max\left(\text{d}(\mathbf{e}_{tgt}^{A},\mathbf{e}_{tgt}^{B})-\text{d}( \mathbf{e}_{tgt}^{A},\mathbf{e}_{src}^{A})+m,0\right)$
    - $\text{d}(\mathbf{a},\mathbf{b})=1-\frac{\mathbf{a}\cdot\mathbf{b}}{||\mathbf{a}||\,||\mathbf{b}||}$ : embedding $\mathbf{a},\mathbf{b}$ 간의 cosine distance ($[0,2]$ range), $m$ : margin
- Cross-lingual fine-tuning stage의 training loss $\mathcal{L}_{ft}$는:
  (Eq. 3) $\mathcal{L}_{ft}=\mathcal{L}_{rec}+\mathcal{L}_{adv}+\mathcal{L}_{fm}+\mathcal{L}_{CA}+\lambda_{SC}\mathcal{L}_{SC}$
  - $\lambda_{SC}$ : weight parameter

3. Experiments

- Settings

Dataset : VCTK, AISHELL-1
Comparisons : YourTTS, CyclePPG-XVC, AutoCycle-VC

- Results

전체적으로 CASC-XVC의 성능이 가장 우수함

Ablation Study
- 각 component를 제거하는 경우 성능 저하가 발생함

$t$-SNE 측면에서 SC loss를 적용하는 경우, converted/target speaker distribution은 uniform 하게 나타남
- 즉, 서로 다른 language에 대해서도 same speaker에 대해 consistent speaker representation을 얻을 수 있음

'Paper > Conversion' 카테고리의 다른 글

[Paper 리뷰] EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice Conversion (0)	2025.06.21
[Paper 리뷰] SEVC: Voice Conversion via Structural Entropy (0)	2025.05.30
[Paper 리뷰] AdaptVC: High Quality Voice Conversion with Adaptive Learning (0)	2025.05.09
[Paper 리뷰] NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis (0)	2025.05.06
[Paper 리뷰] Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations (0)	2025.05.05

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] CASC-XVC: Zero-Shot Cross-Lingual Voice Conversion with Content Accordant and Speaker Contrastive Losses

CASC-XVC: Zero-Shot Cross-Lingual Voice Conversion with Content Accordant and Speaker Contrastive Losses

1. Introduction

2. Method

- Model Architecture

- Training and Inference

- Cross-Lingual Fine-Tuning with Content Accordant and Speaker Contrastive Losses

3. Experiments

- Settings

- Results

'Paper > Conversion' 카테고리의 다른 글

티스토리툴바