[Paper 리뷰] Wav2Vec-VC: Voice Conversion via Hidden Representations of Wav2Vec 2.0

티스토리 뷰

Paper/Conversion

[Paper 리뷰] Wav2Vec-VC: Voice Conversion via Hidden Representations of Wav2Vec 2.0

feVeRin 2024. 9. 4. 09:01

Wav2Vec-VC: Voice Conversion via Hidden Representations of Wav2Vec 2.0

Voice conversion을 위해 wav2vec 2.0 representation을 사용할 수 있음
Wav2Vec-VC
- Wav2Vec 2.0 layer의 hidden representation을 aggregate 하여 disentanglement-based voice conversion의 성능을 향상
- Target utterance가 주어졌을 때, speaker/content-related task를 수행하기 위해 hidden representation을 weighting 하여 활용
논문 (ICASSP 2024) : Paper Link

1. Introduction

HuBERT, wav2vec 2.0과 같은 self-supervised learning (SSL) representation을 활용하면 speech processing 성능을 크게 향상할 수 있음
- 특히 wav2vec 2.0의 layer-wise representation은 acoustic-linguistic hierarchy를 가짐
  - Mid-high layer representation은 phoneme classification이 가능하고, low-layer representation은 speaker-level clustering이 가능함
- BUT, 일반적으로 사용되는 wav2vec 2.0 representation의 last layer는 optimal choice가 아님
  - 특히 FragmentVC, S2VC와 같은 voice conversion (VC) task에서 unsatisfactory 한 similarity를 보임

-> 그래서 VC task에서 wav2vec 2.0 representation을 효과적으로 활용할 수 있는 Wav2Vec-VC를 제안

Wav2Vec-VC
- Disentanglement-based VC를 기반으로 wav2vec 2.0의 all-layer hidden representation을 활용하도록 구성
- Pre-trained layer weight를 사용하여 speaker/content encoder에 필요한 information을 추출하고, 모든 hidden representation에 대한 weighted sum을 적용하여 latent representation을 효과적으로 학습

< Overall of Wav2Vec-VC >

Wav2Vec 2.0에 대한 all-layer representation을 aggregate 하여 disentanglement-based VC에 적용
결과적으로 기존보다 뛰어난 conversion 성능을 달성

2. Method

- Wav2Vec-VC Framework

Wav2Vec-VC는 크게 다음 3가지 part로 구성됨
1. Speaker Weighting on Wav2Vec 2.0
  - Target utterance가 전달되면, wav2vec 2.0 모든 layer의 resulting representation은 weighted sum을 통해 aggregate 됨
    - 여기서 weight는 pre-training phase에서 predefine 됨
  - 해당 speaker weighting은 speaker-related task를 잘 수행하기 위해 모든 hidden representation을 aggregate 하도록 학습됨
2. Content Weighting on Wav2Vec 2.0
  - Source utterance가 전달되었을 때도 마찬가지로, wav2vec 2.0 layer의 모든 output representation은 weighted sum을 통해 aggregate 됨
    - 이때 speaker weight와는 다른 predefined layer weight를 사용
  - 결과적으로 해당 content weighting은 content-related task를 잘 수행하도록 weighting 됨
3. Encoding and Decoding
  - Target speaker에 대한 speaker-weighted sum representation과 source speaker에 대한 content-weighted sum representation은 각각 speaker/content encoder input으로 사용됨
    - Speaker/content encoder는 해당 information을 encoding 하는 역할
  - 최종적으로 decoder에서는 encode 된 두 representation을 결합하여 voice conversion을 수행함

- Determining the Wav2Vec Layer Weights

Wav2Vec-VC는 speaker weighting을 위한 speaker identification, content weighting을 위한 speech recognition을 기반으로 training 되어 layer weight를 결정함
- 먼저 speaker classifier는 frame-level embedding을 위해 mean-pooling을 적용한 다음, cross-entropy loss를 가지는 linear transformation을 사용
  - 한편으로 speech recognizer는 CTC loss가 있는 2-layer 1024-unit BLSTM을 사용
- Layer Weights for Speaker Weighting
  1. $\mathbf{Z}^{i}$를 $n$ layer를 가지는 wav2vec 2.0의 $i$-th layer representation sequence, $\mathbf{w}_{s}=\{w_{s}^{1},w_{s}^{2},...,w_{s}^{n}\}$는 speaker identification (SID)를 통해 학습할 layer weight라고 하자
  2. $\mathbf{Z}_{s}$를 SID에 전달되는 weighted sum representation $\mathbf{Z}_{s}=\sum_{i=1}^{n}w_{s}^{i}\cdot \mathbf{Z}^{i}$라고 하면, cross-entropy loss $H(p,q)$를 최소화하여 $\mathbf{w}_{s}$를 결정할 수 있음:
    (Eq. 1) $H(p,q)=-\sum_{u\in \text{speaker}}p(u)\log q(u|\mathbf{Z}_{s})$
    (Eq. 2) $\mathbf{w}_{s}^{*}=\arg\min_{\mathbf{w}_{s}}H(p,q)$
    - $u$ : speaker의 random variable, $p(u)$ : probability distribution에 대한 ground-truth
    - $q(u|\mathbf{Z}_{s})$ : SID model로 예측된 결과
- Layer Weights for Content Weighting
  1. $\mathbf{w}_{c}=\{w_{c}^{1},w_{c}^{2},...,w_{c}^{n}\}$을 automatic speech regonition (ASR)을 통해 학습할 layer weight라고 하자
  2. 그러면 $\mathbf{Z}_{c}=\sum_{i=1}^{n} w_{c}^{i}\cdot \mathbf{Z}^{i}$는 ASR에 전달되는 weighted-sum representation이고, CTC loss $\mathcal{L}_{CTC}$를 최적화하여 $\mathbf{w}_{c}$를 결정할 수 있음:
    (Eq. 3) $\mathbf{w}_{c}^{*}=\arg\min_{\mathbf{w}_{c}}\mathcal{L}_{CTC}(S,\mathbf{v})$
    - $S$ : input audio의 ground-truth transcript
    - $\mathbf{v}$ : ASR model의 predicted label sequence

- Encoder/Decoder Architecture

Speaker/content encoder와 decoder는 아래 그림과 같이 구성됨
- 여기서 Conv1D block은 single temporal convolution, ELU activation, batch normalization, skip connection을 가짐
  - Linear block은 single lienar layer, ReLU activation, batch normalization을 가짐
- Speaker/Content Encoders
  1. 논문은 time-invariant information을 추출하기 위해, time-dimension에서 speaker encoder representation의 평균, 표준편차 $(\mu, \sigma)$를 계산한 다음 speaker embedding과 결합하여 사용함
  2. Content encoding의 경우, instance normalization과 bottleneck layer를 사용하여 speaker-dependent information을 제거함
- Decoder
  1. 각 encoder로 얻어진 speaker/content embedding을 combine 하기 위해 논문은 Adaptive Instance Normalization을 도입함
  2. 이때 Wav2Vec-VC는 reconstruction error 만을 loss function으로 사용

3. Experiments

- Settings

Dataset : VCTK
Comparisons : AdaIN-VC, FragmentVC, S2VC

- Results

Pretrained Layer Weights
- 적절한 speaker/content weighting을 찾기 위해 SID/ASR model의 성능을 비교해 보면
- Speaker weighting의 경우, 대부분 low layer에 weighting 됨
- Content weighting의 경우, mid-high layer에 weighting 됨

VC Performance
- 다른 모델과 비교하여 Wav2Vec-VC의 성능이 가장 뛰어남

MOS 측면에서도 Wav2Vec-VC가 가장 우수한 intelligibility와 similarity를 달성

Visualization of Speaker/Content Embeddings
- Speaker/content embedding에 대한 t-SNE 결과를 확인해 보면, speaker embedding에서 각 utterance는 동일한 speaker 끼리 cluster 되어 있음
- Content embedding의 경우 speaker-level cluster를 형성하지 않으므로, speaker information을 포함하지 않음

'Paper > Conversion' 카테고리의 다른 글

[Paper 리뷰] Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy (0)	2024.09.16
[Paper 리뷰] TriAAN-VC: Triple Adaptive Attention Normalization for Any-to-Any Voice Conversion (0)	2024.09.10
[Paper 리뷰] ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-Supervised Speech Representations (0)	2024.09.02
[Paper 리뷰] PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts (0)	2024.09.01
[Paper 리뷰] DreamVoice: Text-Guided Voice Conversion (0)	2024.08.31

최근에 올라온 글

최근에 달린 댓글

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Wav2Vec-VC: Voice Conversion via Hidden Representations of Wav2Vec 2.0

Wav2Vec-VC: Voice Conversion via Hidden Representations of Wav2Vec 2.0

1. Introduction

2. Method

- Wav2Vec-VC Framework

- Determining the Wav2Vec Layer Weights

- Encoder/Decoder Architecture

3. Experiments

- Settings

- Results

'Paper > Conversion' 카테고리의 다른 글

티스토리툴바