[Paper 리뷰] SEVC: Voice Conversion via Structural Entropy

티스토리 뷰

Paper/Conversion

[Paper 리뷰] SEVC: Voice Conversion via Structural Entropy

feVeRin 2025. 5. 30. 17:35

SEVC: Voice Conversion via Structural Entropy

기존의 voice conversion method는 prosody leakage, speech representation blurring의 문제가 있음
SEVC
- Source, reference speech에서 self-supervised representation을 추출하고 reference speech representation을 graph로 구축
- 이후 2D Structural Entropy를 사용하여 semantically similar representation을 clustering
  - Voice conversion 시 source representation의 각 frame을 new node로 취급하고, SE를 통해 각 node에 대한 appropriate semantic cluster를 identify 함
논문 (ICASSP 2025) : Paper Link

1. Introduction

Voice Conversion (VC)는 original linguistic content는 preserve 하면서 source speaker speech를 modify 하여 target speaker와 resemble 하게 만드는 것을 목표로 함
- 이때 VC model은 high intelligibility를 maintain 하면서 vocal characteristic을 accurately replicate 해야 함
- 대표적으로 FreeVC, AutoVC 등은 speaker embedding model을 사용해 target speaker characteristic을 capture 하고 information bottleneck, data augmentation 등을 적용함
- 한편으로 kNN-VC와 같이 frame-level self-supervised representation을 활용하면 conversion task를 simplify 할 수 있지만, intelligibility 측면에서 여전히 한계가 있음
  1. 이는 WavLM의 nearest neighbor가 content similarity 외에도 prosodic information을 inadvertently capture 할 수 있기 때문
  2. 추가적으로 kNN 사용 시 feature에 대한 over-smoothing이 발생하므로, blurred speech characteristic으로 인한 intelligibility 저하가 나타남

-> 그래서 target speaker에 대한 disentangled, informative representation을 얻을 수 있는 SEVC를 제안

SEVC
- Neural Structural Entropy (SE)를 채택하여 reference speech representation을 structured graph로 구성
  - 여기서 각 frame은 node에 해당하고, edge는 semantic similarity를 denote 함
- 2D structural entropy를 사용하여 semantically similar representation을 clustering
  - Conversion 시 각 source speech frame은 node로 처리되고, nearest semantic cluster와 matching 되고, reference speech의 centroid representation으로 replace 됨

< Overall of SEVC >

Neural Structural Entropy에 기반한 voice conversion model
결과적으로 기존보다 뛰어난 conversion 성능을 달성

2. Method

SEVC는 encoder-converter-vocoder framework를 따르는 kNN-VC와 유사함
- 이때 2D SE를 통해 converter를 optimize 하여 target speaker에 대한 disentangled, informative representation을 얻음

- Problem Formalization

Source voice와 short target voice가 주어지면,
- 먼저 pre-trained WavLM을 사용하여 frame-level source voice representation sequence $X^{s}=[x_{1}^{s},x_{2}^{s},...,x_{N_{s}}^{s}]$와 frame-level target voice representation set $X^{t}=[x_{1}^{t},x_{2}^{t},...,x_{N_{t}}^{t}]$를 추출함
  - $x^{s}(x^{t})\in\mathbb{R}^{d}$ : frame의 representation
  - Sequence length $N_{s}$와 set cardinality $N_{t}$는 source, target voice duration에 비례함
- 이후 target voice representation set에 대해, speech feature graph $G=(V,E,W)$를 구축함
  - $V=\{\text{ver}_{1},\text{ver}_{2},...,\text{ver}_{n}\}$ : $X^{t}$의 speech feature에 해당하는 vertice set
  - $E$ : vertice를 connect 하는 edge set
  - $W$ : speech feature의 각 frame 간 similarity를 measuring 하는 edge weight set
- 다음으로 논문은 두 speech feature frame $x_{i}^{t},x_{j}^{t}\in X^{t}$에 대해 cosine-similairty를 구함
- $G$의 partitioning은 $\{\mathbf{m}_{1},...,\mathbf{m}_{i},...,\mathbf{m}_{j},...,\mathbf{m}_{K}\},\mathbf{m}_{i}\subset V, \mathbf{m}_{i}\cap \mathbf{m}_{j}$를 생성함
  - 이는 $K$ cluster (set)을 포함하는 $V$의 partition을 의미하고, 해당 cluster는 target voice representation의 cluster와 semantically identical 함

- Disentangled Matching Set

Speech feature graph partitioning은 $G$를 $\mathcal{P}$를 decoding 하여 smemantic clustering을 define 함
- kNN-VC는 semantically irrelevant information leakage로 인해 speech clarity가 저하되는 경우가 있음
  - 따라서 논문은 2D Structural Entropy (SE) minimization을 통해 matching process를 guide 함
- 먼저 각 speech feature $x_{1}^{t},...,x_{N_{t}}^{t}$는 own cluster에 initially assign 됨
  1. Cluster는 size $n$의 subset으로 grouping 되고 각 subset 내에서 vanilla greedy algorithm이 적용되어 larger cluster로 merge 함
  2. 새로 구성된 cluster는 next iteration으로 넘어가고 iterative process는 모든 speech feature cluster가 simultaneously evaluate 될 때까지 continue 됨
    - Subset 내에서 merge가 불가능한 경우, subset size $n$이 증가하여 potential merging을 고려함
- 아래 그림의 (A)는 node $x_{1}^{t}$에서 $x_{10000}^{t}$ 까지의 speech feature graph construction에 해당하고 (B)는 matching set construction process에 해당함:
  1. 먼저 feature $x_{1}^{t}$부터 $x_{10000}^{t}$까지 separate cluster에 assign 함
    - 그러면 $n=1024$ size의 cluster가 examine 되어 subgraph $G'$을 생성함
  2. 이후 (B.1)과 같이 각 $G'$의 cluster는 vanilla 2D SE minimization을 통해 merge 되어 $\mathcal{P}'$를 구축함
  3. 각 iteration result는 (B.2)와 같이 다음 단계로 전달되고, 모든 speech feature를 encompass 하는 partition $\mathcal{P}'$를 달성할 때까지 continue 됨
- 결과적으로 SEVC는 speech feature graph에서 unsupervised manner로 disentangled semantic cluster를 구성하여 irrelevant information leakage를 reduce 하고 converted speech의 intelligibility를 향상함

- Node Game-based 2D SE Matching

kNN-VC는 representation over-smoothing으로 인해 speech characteristic이 blur 되므로 synthesized speech의 intelligibility가 저하될 수 있음
- 따라서 논문은 위 그림의 (C.1)과 같이 matching process를 graph node가 subgraph를 dynamically categorize 하는 것으로 취급함
  1. 해당 framework에서 source voice representation은 (C.2)와 같이 additional node로 incorporate 되고, 해당 added node는 structural entropy heuristic function을 통해 suitable cluster를 iteratively choice 함
  2. Appropriate cluster가 identify 되면 해당 cluster 내의 node representation이 average 되어 (C.3)과 같이 target speech의 replacement representation을 생성함
- 2D SE matching process는 speaker characteristic을 accuratly preserve 하여 over-smoothing을 방지함
  1. Speech feature graph와 matching set이 $M=[\mathbf{m}_{1},...,\mathbf{m}_{i},...,\mathbf{m}_{K}]$로 주어진다고 하자
  2. Source voice representation node $x^{s}$는 current matching set $\mathbf{m}_{i}$을 choice 하고 matching set을 $ E'=[\mathbf{m}_{1},...,\mathbf{m}'_{i},...,\mathbf{m}_{K},\{x^{s}\}],\,\,(\mathbf{m}_{i}=\mathbf{m}'_{i}\cup\{x^{s}\})$로 change 함
  3. 이때 graph의 2D SE change는 다음과 같이 formulate 됨:
    (Eq. 1) $\Delta_{choose}(x^{s},\mathbf{m}_{i})=\mathcal{H}^{\mathcal{T}}(G)-\mathcal{H}^{\mathcal{T}'}(G) =\sum_{n=1}^{|E|}H^{(2)}(\mathbf{m}_{n})-\sum_{n=1}^{|E'|}H^{(2)}(\mathbf{m}'_{n})$
    $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,=H^{(2)}(\mathbf{m}_{i})-H^{(2)}(\mathbf{m}'_{i})-H^{(2)}(\{x^{s}\})$
    $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,=-\frac{g_{\mathbf{m}_{i}}}{\text{Vol}_{G}}\log\frac{\text{Vol}_{\mathbf{m}_{i}}}{\text{Vol}_{G}}+ \frac{g_{\mathbf{m}'_{i}}}{\text{Vol}_{G}}\log\frac{\text{Vol}_{\mathbf{m}'_{i}}}{\text{Vol}_{G}}- \frac{g_{\mathbf{m}'_{i}}}{\text{Vol}_{G}}\log\frac{\text{Vol}_{\mathbf{m}'_{i}}}{\text{Vol}_{\mathbf{m}_{i}}}-\frac{d_{x}}{\text{Vol}_{G}}\log\frac{\text{Vol}_{G}}{\text{Vol}_{\mathbf{m}_{i}}}$
    - $\Delta_{choose}(x^{s},\mathbf{m}_{i})$ : node $x^{s}$가 cluster $\mathbf{m}_{i}$를 choice 할 때의 2D SE change
    - $\mathcal{T}'$ : matching set $\mathcal{M}'$에 대한 encoding tree
    - $\mathcal{H}^{\mathcal{T}'}(G), \mathcal{H}^{\mathcal{T}}(G)$ : 각 $E,E'$ 하에서 graph의 2D SE
    - $\text{Vol}_{G},\text{Vol}_{\mathbf{m}_{i}}$ : cluster $\mathbf{m}_{i},\mathbf{m}'_{i}$에서 graph volume
    - $g_{\mathbf{m}_{i}}, g_{\mathbf{m}'_{i}}$ : $\mathbf{m}_{i},\mathbf{m}'_{i}$의 total cut edge weight
- 결과적으로 source voice representation은 2D SE의 smallest change value에 해당하는 semantic clustering을 choice 하고 join 함:
  (Eq. 2) $t=\text{Min}\left(\Delta_{choose}(x^{s},\mathbf{m}_{i})\right)$
  - $t$ : target cluster index, $\text{Min}$ : smallest 2D SE change value에 해당하는 matching set index를 find 하는 operation

- Pipeline of SEVC

먼저 WavLM은 source, reference speech 모두에서 self-supervised representation을 추출함
- 이후 SE converter는 structural entropy를 사용하여 각 source speech frame을 reference speech에 mapping 함
- 최종적으로 vocoder는 converted feature로부터 waveform을 합성함
  - 해당 pipeline은 non-parametric 하고 additional training이 필요하지 않지만, vocoder는 optimized performance를 위해 further training이 필요할 수 있음

3. Experiments

- Settings

Dataset : LibriSpeech
Comparisons : kNN-VC, FreeVC, VQMIVC, YourTTS

- Results

전체적으로 SEVC의 성능이 가장 우수함

Analysis of Speech Representations Blurring
- SEVC는 아래 그림의 (b)와 같이 accurate prediction을 제공함

Analysis of Prosody Information Leakage
- $t$-SNE 측면에서 SEVC는 prosodic effect를 successfully disentangle 함
- 즉, 서로 다른 prosody를 가지는 speech representation을 효과적으로 clustering 함

'Paper > Conversion' 카테고리의 다른 글

[Paper 리뷰] EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice Conversion (0)	2025.06.21
[Paper 리뷰] CASC-XVC: Zero-Shot Cross-Lingual Voice Conversion with Content Accordant and Speaker Contrastive Losses (0)	2025.05.19
[Paper 리뷰] AdaptVC: High Quality Voice Conversion with Adaptive Learning (0)	2025.05.09
[Paper 리뷰] NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis (0)	2025.05.06
[Paper 리뷰] Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations (0)	2025.05.05

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] SEVC: Voice Conversion via Structural Entropy

SEVC: Voice Conversion via Structural Entropy

1. Introduction

2. Method

- Problem Formalization

- Disentangled Matching Set

- Node Game-based 2D SE Matching

- Pipeline of SEVC

3. Experiments

- Settings

- Results

'Paper > Conversion' 카테고리의 다른 글

티스토리툴바