[Paper 리뷰] Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear Equation

티스토리 뷰

Paper/Representation

[Paper 리뷰] Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear Equation

feVeRin 2025. 6. 24. 17:01

Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear Equation

기존의 Self-Supervised Learning model은 speaker identity를 fully disentangle 하지 못함
Eta-WavLM
- Self-Supervised Learning representation을 speaker-specific, speaker-independent component로 linearly decompose
- 이후 linearly decomposed feature로부터 speaker disentangled representation을 생성
논문 (ACL 2025) : Paper Link

1. Introduction

Text-to-Speech (TTS), Voice Conversion (VC), Automatic Speech Recognition (ASR)과 같은 speech-related task는 상당한 양의 high-quality labeled data가 필요함
- 이를 해결하기 위해 Wav2Vec 2.0, HuBERT, WavLM과 같은 Self-Supervised Learning (SSL) approach는 large unlabeled data로부터 latent representation을 학습함
- SSL representation은 linguistic content, speaker identity, emotion과 같은 다양한 speech attribute를 encode 할 수 있지만 task-agnostic 함
  1. 대표적으로 VC, TTS와 같은 task에서는 rich content와 minimal speaker identity를 가져야 함
    - 반면 speaker classification, verification task에서는 rich speaker information을 가져야 함
  2. 즉, SSL representation에서 speaker/non-speaker information을 disentangle 하면 task-specific performance를 크게 향상할 수 있음
- 해당 speaker disentanglement를 위해 $k$-means clustering을 사용할 수 있지만, linguistic content와 prosody를 compromise 할 수 있음
  1. 한편으로 NANSY와 같이 perturbation technique을 사용하거나 ContentVec과 같이 content-related-only feature를 추출할 수도 있음
  2. BUT, 해당 방식들은 complex, resource-intensive strategy가 필요하고 여전히 high-level disentanglement 측면에서 한계가 있음

-> 그래서 SSL representation의 simple speaker identity disentangling을 위한 Eta-WavLM을 제안

Eta-WavLM
- Complex training strategy, loss function, fine-tuning 등을 사용하지 않고 SSL representation에서 speaker identity를 simply disentangling
- SSL representation을 speaker-dependent $\mathbf{d}$와 speaker-independent $\eta$로 linearly decompose 하여 $\text{eta}$ representation을 추출
  - 즉, $\mathbf{d}$가 known일 때 linear inverse problem을 solve 하여 speaker-independent $\text{eta}$ representation을 추출함

< Overall of Eta-WavLM >

Linear decomposition을 활용하여 speaker identity disentangling을 수행하는 SSL model
결과적으로 기존보다 우수한 성능을 달성

2. Method

Eta-WavLM은 disentangled $\text{eta}$ representation을 추출하는 것을 목표로 다음 3가지 component로 구성됨:
- SSL Model
  - Raw Waveform에서 SSL representation을 추출하는 역할
- Speaker Encoder
  - Same waveform에서 speaker embedding을 생성하는 역할
- Disentanglement Module
  - Speaker embedding으로 condition 된 input SSL representation에서 speaker-independent $\text{eta}$ representation을 derive 하는 역할

- Problem Definition

Disentanglement module은 SSL representation $\mathbf{s}$를 speaker-dependent $\mathbf{d}$와 speaker-independent $\eta$ component로 decompose 함
- 주어진 data point에 대해, $\mathbf{s},\mathbf{d}$는 각각 pre-trained SSL model과 pre-trained speaker encoder를 사용하여 얻을 수 있음
- 따라서 $\mathbf{s}$는 known $\mathbf{d}$와 additional unknown term $\eta$의 function으로 express 되고, 이때 $\eta$는 $\mathbf{d}$로부터 infer 할 수 없는 모든 information을 포함해야 함
  1. 이를 위해 논문은 (Eq. 1)과 같은 additive relationship을 고려함:
    (Eq. 1) $\mathbf{s}=f(\mathbf{d})+\eta$
    - 이상적으로 $\eta$는 $\mathbf{d}$가 speaker characteristic을 effectively represent 할 때 linguistic, prosodic, environment information을 포함해야 함
  2. 결과적으로 speaker-independent component $\eta$는 다음과 같이 compute 됨:
    (Eq. 2) $\eta=\mathbf{s}-f(\mathbf{d})$

- Computation of Latent Basis and Bias

Large embedding space는 complex non-linear relationship을 linearize 할 수 있으므로, 논문은 linear model $f()$를 사용하여 embedding space를 approximate 함
- 먼저 raw speech의 $U$ utterance로 구성된 multi-speaker dataset을 고려하자
  1. 여기서 generic utterance를 $\mathbf{u}_{i}$, pre-trained speaker encoder $\mathcal{E}$에서 추출된 speaker embedding을 $\mathbf{e}_{i}\in\mathbb{R}^{V}$와 같이 denote 할 수 있음
    - $i\in[1,U]$
  2. Pre-trained SSL model $\mathcal{S}$에서 추출된 SSL representation은 $\mathbf{S}_{i}=[\mathbf{s}_{1},...,\mathbf{s}_{M}]^{\top}$과 같음
    - $\mathbf{s}_{m}\in\mathbb{R}^{Q}$ : $m$-th frame, $M$ : sequence length
  3. 이때 $M$은 클 수 있으므로, 각 utterance에서 $L$ frame을 randomly subsample 하여 fixed-length representation $\mathbf{S}_{i}\in\mathbb{R}^{L\times Q}$를 생성함
  4. 결과적으로 모든 $\mathbf{S}_{i}$를 stacking 하여 얻어지는 entire SSL representation은 $\mathbf{S}\in\mathbb{R}^{N\times Q}$와 같음
    - $N=U\times L$
- $\mathbf{S}$의 sequence length에 $\mathbf{e}$를 align하기 위해, 논문은 speaker embedding이 speaker-level information을 capture하고 utterance의 모든 frame에 대해 constant하다고 가정함
  1. 이를 기반으로, $\mathbf{e}$를 frame axis를 따라 $L$번 replicate 하여 $\mathbf{E}_{i}\in\mathbb{R}^{V\times L}$을 얻음
  2. 그러면 모든 $\mathbf{E}_{i}$를 stacking 하여 얻어지는 entire embedding representation은 $\mathbf{E}\in\mathbb{R}^{V\times N}$과 같이 주어짐
  3. 이때 $V$가 클 수 있으므로, Principal Component Analysis (PCA)를 적용하여 dimension을 $P<V$로 reduce 해 $\mathbf{D}\in\mathbb{R}^{P\times N}$을 얻을 수 있음
    - 해당 reduction은 redundancy를 remove 하고 informative component만 retain 함
- $\mathbf{S}\in\mathbb{R}^{N\times Q}$와 $\mathbf{D}\in\mathbb{R}^{P\times N}$이 주어졌을 때, 해당 relationship은 다음과 같이 modeling 됨:
  (Eq. 3) $\mathbf{S}=\mathbf{D}^{\top}\mathbf{A}+\mathbf{1}_{N}\mathbf{b}^{\top}$
  - $\mathbf{A}\in\mathbb{R}^{P\times Q}, \mathbf{b}\in\mathbb{R}^{Q\times 1}$ : learnable parameter
- (Eq. 3)을 rewrite 하면:
  (Eq. 4) $\mathbf{S}=\tilde{\mathbf{D}}^{\top}\tilde{\mathbf{A}}$
  - $\tilde{\mathbf{D}}^{\top}=[\mathbf{D}^{\top}\mathbf{1}], \tilde{\mathbf{A}}^{\top}=[\mathbf{A}^{\top}\mathbf{b}]$
- 그러면 optimization problem은 다음과 같이 주어짐:
  (Eq. 5) $\tilde{\mathbf{A}}^{*}=\arg\min_{\tilde{\mathbf{A}}}\left|\left| \mathbf{S}-\tilde{\mathbf{D}}^{\top}\tilde{\mathbf{A}}\right|\right|_{F}$
- (Eq. 5)는 pseudo-inverse로 solve 될 수 있음:
  (Eq. 6) $\tilde{\mathbf{A}}^{*}=\left(\tilde{\mathbf{D}}^{\top}\tilde{\mathbf{D}}\right)^{-1}\tilde{\mathbf{D}}^{\top}\mathbf{S}$
  - $\tilde{\mathbf{A}}^{*\top}=[\mathbf{A}^{*\top}\mathbf{b}^{*}]$, $\mathbf{A}^{*},\mathbf{b}^{*}$ : 각각 latent bias, bias
  - 해당 과정을 통해 function $f()$가 학습되고, $\mathbf{A}^{*},\mathbf{b}^{*}$이 known이므로 disentanglement module을 통해 $\text{eta}$ representation을 생성할 수 있음

- Creation of Eta Representations

추론 시에는 raw waveform에서 speaker-independent $\text{eta}$ representation을 생성함
- 먼저 utterance $\mathbf{u}'$가 주어지면, pre-trained SSL model $\mathcal{S}$를 통해 SSL representation $\mathbf{S}\in\mathbb{R}^{K\times Q}$를 추출함:
  (Eq. 7) $ \mathbf{S}=\mathcal{S}(\mathbf{u}';\mathbf{W}_{\mathcal{S}})$
  - $\mathbf{W}_{\mathcal{S}}$ : SSL model의 frozen parameter, $K$ : sequence length
- 다음으로 pre-trained speaker encoder $\mathcal{E}$는 speaker embedding $\mathbf{e}\in\mathbb{R}^{V\times 1}$을 생성함:
  (Eq. 8) $\mathbf{e}=\mathcal{E}(\mathbf{u}';\mathbf{W}_{\mathcal{E}})$
  - $\mathbf{W}_{\mathcal{E}}$ : speaker encoder의 frozen parameter
- $\mathbf{e}$의 dimensionality를 reduce 하기 위해 PCA를 적용하여 $\mathbf{d}\in\mathbb{R}^{P\times 1}$을 얻음:
  (Eq. 9) $\mathbf{d}=\mathcal{PCA}(\mathbf{e};\mathbf{C}_{\mathcal{PCA}})$
  - $\mathbf{C}_{\mathcal{PCA}}$ : PCA process를 통해 얻어진 principal component matrix
- 최종적으로 disentanglement module $\mathcal{H}$를 통해 speaker-independent $\text{eta}$ representation $\eta\in\mathbb{R}^{K\times Q}$를 추출함:
  (Eq. 10) $\eta=\mathcal{H}(\mathbf{S};\mathbf{d},\mathbf{A}^{*},\mathbf{b}^{*})$
  - $\mathbf{A}^{*},\mathbf{b}^{*}$ : 각각 latent bias와 first step 이후 얻어지는 bias
- 이때 $\mathcal{H}()$는:
  (Eq. 11) $\mathcal{H}(\mathbf{S})=\mathbf{S}=\mathbf{1}_{K}(\mathbf{d}^{\top}\mathbf{A}^{*}+\mathbf{b}^{*})$
  - 여기서 논문은 SSL model $\mathcal{S}$로써 WavLM을 사용하므로, SSL representation $\mathbf{S}$는 WavLM representation, $\eta$는 Eta-WavLM representation으로 볼 수 있음

3. Experiments

- Settings

Dataset : LibriSpeech
Comparisons : WavLM

- Results

Eta-WavLM은 WavLM 보다 우수한 classification accuracy를 달성함

UMAP projection 측면에서 Eta-WavLM representation은 discernible speaker cluster가 나타나지 않음
- 즉, speaker-specific information을 effectively minimize 함

Voice Conversion Task
- Voice Conversion task에 대해, Eta-WavLM을 사용하면 뛰어난 성능을 달성할 수 있음

Ablation Study
- ECAPA-TDNN에 PCA를 적용할 때 linguistic content를 가장 효과적으로 preserve 할 수 있음

'Paper > Representation' 카테고리의 다른 글

[Paper 리뷰] SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing (0)	2025.07.12
[Paper 리뷰] BEATs: Audio Pre-Training with Acoustic Tokenizers (0)	2025.06.28
[Paper 리뷰] SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic Organization in HuBERT (0)	2025.06.19
[Paper 리뷰] UniWav: Towards Unified Pre-Training for Speech Representation Learning and Generation (0)	2025.06.14
[Paper 리뷰] Balanced-Wav2Vec: Enhancing Stability and Robustness of Representation Learning through Sample Reweighting Techniques (0)	2025.06.12

최근에 올라온 글

최근에 달린 댓글

« 2025/12 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear Equation

Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear Equation

1. Introduction

2. Method

- Problem Definition

- Computation of Latent Basis and Bias

- Creation of Eta Representations

3. Experiments

- Settings

- Results

'Paper > Representation' 카테고리의 다른 글

티스토리툴바