[Paper 리뷰] S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations

티스토리 뷰

Paper/Conversion

[Paper 리뷰] S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations

feVeRin 2024. 8. 25. 09:26

S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations

Any-to-Any Voice Conversion은 seen/unseen speaker의 모든 utterance로 변환을 수행할 수 있어야 함
S2VC
- Source/target feature로 self-supervised feature를 사용
- Speaker-independent 하고 content information을 추출할 수 있는 supervised phoneme posteriorgram을 baseline feature로 선정
논문 (INTERSPEECH 2021) : Paper Link

1. Introduction

Self-Supervised Learning (SSL)은 unlabeled data를 활용할 수 있다는 장점이 있음
- 특히 speech corpus에서 pretrain된 SSL model은 downstream task에서 활용할 수 있는 representation을 제공함
- Voice Conversion (VC)는 original phonetic content를 preserve 하면서 source utterance를 target speaker로 변환하는 것을 목표로 함
  1. 일반적으로 VC는 source/target utterance에서 contet/speaker information을 disentangle 하여 수행됨
  2. 대표적으로 Phoneme PosteriorGram (PPG)와 같은 supervised pretrained representation은 VC task에 적합한 information을 제공
    - PPG는 speaker-independent하므로 speaker characteristic을 제거하는데 적합하기 때문
- 한편으로 FragmentVC와 같이 SSL representation을 활용하여 any-to-any VC로 task를 확장할 수도 있음

-> 그래서 다양한 pretrained SSL representation을 활용한 any-to-any VC model인 S2VC를 제안

S2VC
- SSL representation에서 phonetic information 뿐만 아니라 target speaker information도 추출
- Autoregressive Predictive Coding (APC), Contrastive Predictive Coding (CPC), Wav2Vec 2.0 등에 대한 다양한 SSL representation을 비교

< Overall of S2VC >

다양한 SSL representation과 FragmentVC를 결합
결과적으로 SSL representation을 사용해 기존 PPG 보다 뛰어난 conversion 성능을 달성

2. Method

S2VC는 pretrained SSL moddel을 통해 source/target feature를 추출하여 사용함

- Baseline: FragmentVC

S2VC는 FragmentVC를 기반으로 구축됨
- 구조적으로 FragmentVC는 source encoder, target encoder, cross attention module, decoder로 구성
- Cross attention은 아래 그림과 같이 source encoder의 output feature $Q$를 가져오고 target encoder에서 두 개의 output feature $K, V$를 가져와서 사용함
  1. Target encoder output feature sequence $K$는 source encoder output $Q$에 의해 attend됨
  2. 해당 architecture에서 cross attention module은 source feature를 유사한 speech content를 가지는 target feature에 align 하는 방법을 학습하게 됨
- 최종적으로 decoder는 attention-argumented feature $V$로부터 converted mel-spectrogram을 생성함
- Encoder는 explicit constraint 없이 content/speaker information을 disentangle 하는 방법을 학습함

- Modifications

S2VC는 source/target feature를 align 하기 위해 다음을 반영하여 cross-attention module을 개선함
1. Self-attention pooling은 source encoder가 encoding 한 representation을 target encoder가 encoding한 representation과 가까워지도록 함
2. Attention information bottleneck은 $Q, K$로 encoding 된 representation에서 redundant information을 제거해 attention이 phonetic content information만 고려하도록 함
Self-Attention Pooling
- Self-attention pooling은 time-invariant feature를 추출하는데 효과적이므로, 논문에서는 이를 활용하여 target encoder의 representation을 추출함
- 이후 추출된 representation은 source encoder에 반영되어 source encoder의 representation을 target encoder의 representation에 가까워지도록 함
Attention Information Bottleneck
- AdaIN-VC에서 instance normalization은 speaker-dependent information을 제거하는데 효과적인 것으로 나타남
- 마찬가지로 AutoVC는 encoder layer의 hidden dimension을 사용하여 speaker-independent content information을 추출함
- 따라서 논문에서는 $Q, K$ 모두에 instance normalization을 적용하여 attention layer에 결합한 다음, bottleneck layer를 추가하여 speaker information을 제거함

- SSL Representations

SSL Representation으로는 APC, CPC, Wav2Vec 2.0을 고려할 수 있음
- APC는 RNN-based language model과 유사한 방식으로 representation을 학습함
  - Mel-spectrogram을 input으로 하여 past에 대한 future conditioning을 예측하는 autoregressive 방식을 활용
- 한편으로 CPC, Wav2Vec 2.0은 waveform을 input으로 사용함
  1. CPC는 autoregressive 하게 동작하지만, compact latent space에서 예측이 수행되고 probabilistic contrastive loss를 최적화하여 학습됨
  2. Wav2Vec 2.0은 autoregressive prediction을 BERT-like masked language model로 대체하여 CPC를 개선함
- 이때 기존 FragmentVC에서는 source encoder input으로 wav2vec 2.0 representation을 사용하고, target encoder에서는 target speaker의 mel-spectrogram을 사용함
  - 이와 달리 S2VC에서는 APC, CPC, wav2vec 2.0에 대한 여러 representation 조합을 source/target encoder에 적용

3. Experiments

- Settings

Dataset : VCTK
Comparisons : FragmentVC

- Results

Mel-spectrogram, PPG, APC, CPC, wav2vec 2.0에 대한 각각의 representation을 비교해 보면, CPC+CPC의 성능이 가장 뛰어난 것으로 나타남

MOS 측면에서도 CPC representation이 가장 우수함

Unseen-to-Unseen conversion에서도 CPC가 가장 뛰어난 성능을 보임

Speaker Information Probing Analysis
- CPC를 source/target feature로 사용하는 model에 대해 query $Q$, key $K$, value $V$에 대한 speaker classification (SC)를 수행해 보면,
- Query, Key feature에 대한 SC accuracy는 상당히 낮게 측정됨
  - 즉, instance normalization과 bottleneck이 speaker-dependent information을 효과적으로 제거할 수 있음
- Value의 경우, CPC를 사용할 때 더 높은 accuracy를 달성함
  - 즉, CPC는 VC에 필요한 rich speaker information을 제공할 수 있음

Ablation Study
- 각 component를 제거하는 경우, S2VC의 성능이 저하됨

'Paper > Conversion' 카테고리의 다른 글

[Paper 리뷰] FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion (0)	2024.08.28
[Paper 리뷰] StreamVC: Real-Time Low-Latency Voice Conversion (0)	2024.08.27
[Paper 리뷰] MaskCycleGAN-VC: Learning Non-Parallel Voice Conversion with Filling in Frames (0)	2024.08.22
[Paper 리뷰] CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-Spectrogram Conversion (0)	2024.08.21
[Paper 리뷰] AVQVC: One-Shot Voice Conversion by Vector Quantization with Applying Contrastive Learning (0)	2024.08.20

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations

S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations

1. Introduction

2. Method

- Baseline: FragmentVC

- Modifications

- SSL Representations

3. Experiments

- Settings

- Results

'Paper > Conversion' 카테고리의 다른 글

티스토리툴바