[Paper 리뷰] REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion

티스토리 뷰

Paper/Conversion

[Paper 리뷰] REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion

feVeRin 2025. 9. 23. 17:01

REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion

Speech Time Reversal은 speaker identification을 위한 tonal pattern을 가지고 있음
REWIND
- Time-reversed speech에서 학습된 speaker representation을 활용한 augmentation strategy를 도입
- Diffusion-based voice conversion model에 적용하여 speaker의 unique vocal trait를 preserve 하면서 linguistic content의 interference를 minimize
논문 (INTERSPEECH 2025) : Paper Link

1. Introduction

Voice Conversion (VC)는 original linguistic content는 retaining 하면서 source speaker의 speech signal이 target speaker로 perceive 되도록 altering 하는 것을 목표로 함
- 특히 one-/zero-shot VC는 unseen voice에 대한 flexibility를 제공함
- 일반적으로 VC model은 speaker identity를 capture 하기 위해 speaker encoder를 활용함
  1. 해당 encoder는 large dataset을 통해 pre-training 되거나 VC model과 함께 jointly modeling 될 수 있음
  2. 이때 speaker embedding은 speaker identity를 characterizing하기 위한 timbre, pitch, prosody 등의 attribute를 capture 함
    - BUT, speaker embedding은 speaker-specific characteristic과 language-related feature 간의 entanglement로 인해 interference가 존재함

-> 그래서 VC task를 위한 speaker representation을 향상할 수 있는 REWIND를 제안

REWIND
- Entire speech signal을 time-axis를 따라 inverting 하는 Speech Time Reversal (STR)을 활용해 data augmentation을 수행
- Speaker embedding을 time-reversed speech에서 training 하고 original speaker embedding과 fusing 하여 diffusion-based decoder를 further conditioning

< Overall of REWIND >

VC system 개선을 위해 Speech Time Reversal를 활용한 data augmentation strategy
결과적으로 기존보다 우수한 VC 성능을 달성

2. Method

- Speech Time Reversal (STR) Strategy

$t\in[0,l]$에 대해 $x(t)$를 original signal이라고 하면, time axis를 따라 $x(t)$를 flipping 하여 time-reversed speech signal $x(l-t)$를 얻을 수 있음
- 먼저 time-reversed speech에서 speaker identity preservation을 assess 하기 위해 논문은 controlled perceptual study를 진행함
- 이를 통해 논문은 speaker의 unique vocal attribute (timbre, intonation)가 complete Speech Time Reversal process 이후에도 recognizable 한 지를 analyze 함
  - 결과적으로 아래 그림과 같이 time-reversed speech signal에서 $80.3\%$의 accuracy로 correct spekaer를 identify 할 수 있는 것으로 나타남

Reversed Speech에 대한 Speaker Identification

추가적으로 Speech Time Reversal과 Short-Time Speech Reversal을 비교해 보면
- 아래 그림과 같이 complete speech reversal은 harmonic structure가 prominently visible 하게 나타나므로, speaker-specific information에 대한 strong retention이 가능함
- 즉, harmonic pattern의 clear presence는 reversed speech가 unintelligible 하더라도, timbre/pitch와 같은 unique speaker에 대한 acoustic cue를 preserve 할 수 있다는 것을 의미함

Spectrogram (a) Original (b) 20ms Short-Time (c) 100ms Short-Time Speech Reversal (d) Complete STR

실제로 WER과 speaker similarity (SS)를 비교해 보면, complete speech reversal이 linguistic interference 없이 speaker identity를 preserve 하는데 더 효과적임

- Diffusion-based Voice Conversion

DiffVC, Diff-HierVC, DDDM-VC와 같은 denoising diffusion model은 VC task에서 우수한 성능을 달성함
- 해당 diffusion VC model은 data에 noise를 incrementally add 하는 diffusion process와 original signal을 gradually recover 하는 reverse denoising process로 구성됨
  - Markov chain을 사용하는 기존 diffusion model과 달리 score-based modeling을 사용하면 flexibility와 sample quality를 향상할 수 있음
- 특히 논문은 multiple disentangled denoiser를 활용해 style attribute를 control 하는 DDDM-VC를 backbone architecture로 사용함
  - DDDM-VC는 source-filter encoder/decoder를 기반으로 self-supervised speech representation을 disentangle 하고 conversion robustness를 향상하기 위해 mixup technique을 도입함
- 구조적으로 DDDM-VC는 speech disentanglement를 위해 3가지의 representation을 활용함:
  1. Content Representation : Wav2Vec 2.0, XLS-R에서 얻어지는 phonetic content
  2. Pitch Representation : YAPPT algorithm을 통해 fundamental frequency $F0$를 추출하고, VQ-VAE를 통해 capture 되는 intonation
  3. Speaker Representation : target mel-spectrogram에 대해 speaker embedding network를 통해 추출되는 speaker representation
    - 해당 feature들은 utterance-level에서 average 되고 network 전반에 integrate 되어 robust zero-shot voice style transfer를 지원함
- 여기서 논문은 speaker representation을 further enhance 하기 위해 original speaker representation과 Speech Time Reversal에서 derive 된 speaker representation을 combining 함
  1. 두 embedding은 weighted fusion layer를 통해 integrate 되고, combined speaker representation $S_{cmb}$는 다음과 같이 정의됨:
    (Eq. 1) $S_{cmb}=\alpha*S+\beta*S_{rev}$
    - $S, S_{rev}$ : 각각 original/time-reversed speech의 speaker embedding
    - $\alpha,\beta$ : $[0,1]$ range의 weight coefficient
  2. 이후 pitch를 처리하는 source encoder와 content를 처리하는 filter encoder를 통해 fully reconstructed source/filter mel-spectrogram $Z_{src}, Z_{ftr}$를 얻고 해당 prior를 기반으로 reverse process를 수행함
  3. 이는 (Eq. 2)의 loss function을 통해 target mel-spectrogram $X_{mel}$에 regularize 됨:
    (Eq. 2) $\mathcal{L}_{rec}=||X_{mel}-(Z_{src}+Z_{ftr})||_{1}$
    - $Z_{src}=E_{src}(\text{pitch},s),Z_{ftr}=E_{ftr}(\text{content},S_{cmb})$
- 결과적으로 entire model은 pre-trained XLS-R, $F0$ VQ_VAE 없이 style encoder, source-filter encoder, decoder 만으로 jointly training 됨

3. Experiments

- Settings

Dataset : LibriTTS, VCTK
Comparisons : StyleVC, DiffVC, Diff-HierVC, LM-VC, SEF-VC, StableVC, VALL-E

- Results

전체적으로 REWIND를 적용하면 VC 성능을 더 개선할 수 있음

Ablation Study
- $0.5$의 equal weighting을 사용했을 때 최적의 speaker similarity를 달성함

'Paper > Conversion' 카테고리의 다른 글

[Paper 리뷰] ZSDEVC: Zero-Shot Diffusion-based Emotional Voice Conversion with Disentangled Mechanism (0)	2025.09.17
[Paper 리뷰] Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion (0)	2025.09.13
[Paper 리뷰] DiffEmotionVC: A Dual-Granularity Disentangled Diffusion Framework for Any-to-Any Emotion Voice Conversion (0)	2025.09.08
[Paper 리뷰] Training-Free Voice Conversion with Factorized Optimal Transport (0)	2025.09.02
[Paper 리뷰] FasterVoiceGrad: Faster One-Step Diffusion-based Voice Conversion with Adversarial Diffusion Conversion Distillation (0)	2025.08.24

최근에 올라온 글

최근에 달린 댓글

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion

REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion

1. Introduction

2. Method

- Speech Time Reversal (STR) Strategy

- Diffusion-based Voice Conversion

3. Experiments

- Settings

- Results

'Paper > Conversion' 카테고리의 다른 글

티스토리툴바