[Paper 리뷰] VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance

티스토리 뷰

Paper/TTS

[Paper 리뷰] VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance

feVeRin 2025. 4. 2. 20:24

VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance

Speaker adaptive text-to-speech model에 paramter-efficient fine-tuning을 적용하는 경우, out-of-domain speaker에 대한 adaptation performance의 한계가 있음
VoiceGuider
- Autoguidance로 reinforce 된 speaker adaptive text-to-speech model
- Autoguidance strengthening strategy를 통해 out-of-domain data에 대한 robustness를 향상
논문 (ICASSP 2025) : Paper Link

1. Introduction

Speaker-adaptive Text-to-Speech (TTS)는 unseen speaker에 대해 speech synthesis를 수행함
- 이를 위해 VALL-E, P-Flow, VoiceBox와 같은 zero-shot approach나 AdaSpeech, UnitSpeech와 같은 few-shot adaptation을 고려할 수 있음
  1. Zero-shot adaptation은 additional training에 대한 필요성을 줄일 수 있지만, 일반적으로 상당한 training resource, model size를 요구함
  2. 반면 few-shot adaptation은 zero-shot method 보다 더 적은 training cost를 사용하면서도 우수한 합성 성능을 얻을 수 있음
    - 특히 VoiceTailor는 Low-Rank Adaptation (LoRA)와 diffusion decoder를 활용하여 efficient one-shot speaker-adaptive TTS가 가능함
- BUT, 대부분의 one-shot TTS model은 well-constrained speech data에 의존하므로 real-world scenario의 in-the-wild reference에 robust 하지 않음
  - 결과적으로 pre-training distribution에서 deviate 하는 Out-of-Distribution (OOD) data에 대해 성능 저하를 보임

-> 그래서 OOD data에 대한 성능 저하를 완화한 parameter-efficient speaker-adaptive TTS model인 VoiceGuider를 제안

VoiceGuider
- Parameter-efficient one-shot TTS model인 VoiceTailor를 backbone으로 사용
- OOD data에 대한 adaptation 성능을 향상하기 위해, degraded model을 통해 generation을 guiding하는 Autoguidance를 도입
- Degraded model candidate exploring을 통해 optimal autoguidance strategy를 identify

< Overall of VoiceGuider >

VoiceTailor를 기반으로 autoguidance를 도입한 parameter-efficient speaker-adaptive TTS model
결과적으로 기존보다 뛰어난 OOD data adaptation 성능을 달성

2. Method

VoiceGuider는 parameter-efficient speaker adaptive TTS에서 OOD performance degradation 해결을 목표로 함

- Background

Denoising Diffusion
- Diffusion model은 multiple step으로 data에 Gaussian noise를 add 하고 denoising process를 통해 data를 생성함
- Speaker-adaptive TTS의 경우 text embedding $c <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>c</mi></math>$ 와 speaker embedding $S <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>S</mi></math>$ 가 주어졌을 때, 다음의 objective를 활용하여 noisy mel-spectrogram $X t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 에 add 된 noise $ϵ t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 를 recover 하도록 diffusion model을 training 함:
  (Eq. 1) $L (θ) = E t, X 0, ϵ t [| | \sqrt 1 - λ t s θ (X t | c, S) + ϵ t | | 22] <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mo stretchy="false">(</mo><mi>θ</mi><mo stretchy="false">)</mo><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>,</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>,</mo><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mrow></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msqrt><mn>1</mn><mo>-</mo><msub><mi>λ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>c</mi><mo>,</mo><mi>S</mi><mo stretchy="false">)</mo><mo>+</mo><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msubsup><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>$
  - $s θ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub></math>$ : diffusion model, $λ t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>λ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ : Grad-TTS의 pre-defined noise schedule, $t \in [0, 1] <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi><mo>\in</mo><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">]</mo></math>$ : noise level
Diffusion Guidance
- Diffusion model은 Classifier-Free Guidance (CFG)를 사용하여 sample quality와 주어진 condition에 대한 likelihood를 개선할 수 있음
- 각 generation step에서 CFG는 두 prediction 간의 extrapolation을 통해 model prediction을 modify 함:
  (Eq. 2) $ˆ s γ (X t | c, S) = s θ (X t | c, S) + γ (s θ (X t | c, S) - s θ (X t | c, \emptyset)) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mi>s</mi><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>γ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>c</mi><mo>,</mo><mi>S</mi><mo stretchy="false">)</mo><mo>=</mo><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>c</mi><mo>,</mo><mi>S</mi><mo stretchy="false">)</mo><mo>+</mo><mi>γ</mi><mo stretchy="false">(</mo><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>c</mi><mo>,</mo><mi>S</mi><mo stretchy="false">)</mo><mo>-</mo><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>c</mi><mo>,</mo><mi mathvariant="normal">\emptyset</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$
  - $γ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>γ</mi></math>$ : guidance scale
- 여기서 CFG는 undesired speaker를 avoid 하기 위해 unconditional distribution을 push away 하여 speaker $S <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>S</mi></math>$ 의 likelihood를 increase 함

- VoiceTailor: Parameter-Efficient Baseline

Pre-trained diffusion model이 주어졌을 때, VoiceTailor는 (Eq. 1)을 사용하여 new speaker에 adapt 하기 위해 parameter-efficient adapter인 LoRA를 training 함
- Pre-trained diffusion model에서 attention module의 각 linear layer $W 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 에 대해 VoiceTailor는 $B \in R d \times r, A \in R r \times k <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>B</mi><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>d</mi><mo>\times</mo><mi>r</mi></mrow></msup><mo>,</mo><mi>A</mi><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>r</mi><mo>\times</mo><mi>k</mi></mrow></msup></math>$ 인 new matrix $Δ W = B A <math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">Δ</mi><mi>W</mi><mo>=</mo><mi>B</mi><mi>A</mi></math>$ 만 training 함
  - 즉, scale을 $α <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>α</mi></math>$ 라고 할 때, $W = W 0 + α \cdot B A <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>W</mi><mo>=</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>+</mo><mi>α</mi><mo>\cdot</mo><mi>B</mi><mi>A</mi></math>$
- 결과적으로 VoiceTailor는 small rank $r <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>r</mi></math>$ 을 choice 하여 whole parameter의 $0.25 % <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>0.25</mn><mi mathvariant="normal">%</mi></math>$ 만 학습함
  - 이를 통해 small parameter 만으로도 fine-tuning baseline인 UnitSpeech 수준의 speaker adaptation 성능을 달성할 수 있음
- BUT, VoiceTailor는 pre-trained domain과 close 한 in-domain speaker에 대해서는 strong performance를 보이지만 OOD speaker에 대해서는 낮은 성능을 보임

- VoiceGuider: Eliminating LoRA Error with Autoguidance

VoiceTailor는 LoRA의 limited capacity로 인해 OOD speaker에 대한 prediction error가 발생하고, 해당 error는 diffusion model의 iterative generation process를 통해 amplify 되므로 성능이 저하될 수 있음
- 따라서 논문은 CFG가 undesired sample을 steer away 한다는 점을 활용하여 speaker-adaptive TTS에서 LoRA error를 eliminate 함
Autoguidance
- Diffusion guidance에서 strong conditional model과 unconditional model은 correlated error를 share 함
  - 경험적으로 unconditional model error가 over-emphasize 되므로, CFG의 extrapolation을 통해 해당 error를 eliminate 할 수 있음
- 이때 unconditional model $s 1 (X t | c, S) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>c</mi><mo>,</mo><mi>S</mi><mo stretchy="false">)</mo></math>$ 대신 inferior model $s 0 (X t | c, S) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>c</mi><mo>,</mo><mi>S</mi><mo stretchy="false">)</mo></math>$ 로 strong model $s 1 (X t | c, S) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>c</mi><mo>,</mo><mi>S</mi><mo stretchy="false">)</mo></math>$ 를 guide 하는 autoguidance와 CFG를 combine 하면:
  (Eq. 3)
  - $γ S, γ a <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>γ</mi><mrow data-mjx-texclass="ORD"><mi>S</mi></mrow></msub><mo>,</mo><msub><mi>γ</mi><mrow data-mjx-texclass="ORD"><mi>a</mi></mrow></msub></math>$ : 각각 CFG, autoguidance의 scale
  - 결과적으로 speaker likelihood를 amplify 하고 LoRA error를 줄이는 것이 가능함
- 추가적으로 speaker-adaptive TTS에 대한 optimal autoguidance strategy를 identify 하기 위해, inferior model $s 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 에 대한 다음의 candidate를 고려함:
  1. Shorter Training Time
    - $s 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$ 에 비해 shorter training time을 소모한 model을 사용하여 $s 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 구성함
    - $s 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$ 의 intermediate checkpoint에서 derive 될 수 있으므로, inferior model을 쉽게 얻을 수 있음
  2. Smaller LoRA Rank
    - $s 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$ 보다 smaller LoRA rank size를 사용하여 $s 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 구성함
    - VoiceTailor를 기준으로, $2, 4 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>2</mn><mo>,</mo><mn>4</mn></math>$ 와 같은 smaller rank는 $16 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>16</mn></math>$ rank size에 비해 inferior 함
Limited Guidance Interval
- Autoguidance는 generation process의 certain interval에서 detrimental 함
  - 특히 high noise level에서 guidance는 data distribution과 blindly push away 되므로 mode dropping과 sample quality degrade가 발생함
- 경험적으로 $t \in [0.6, 1.0] <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi><mo>\in</mo><mo stretchy="false">[</mo><mn>0.6</mn><mo>,</mo><mn>1.0</mn><mo stretchy="false">]</mo></math>$ range에서 guidance가 detrimental하므로, 해당 range에서 CFG와 autoguidance를 모두 disable하여 error를 줄이고 higher speaker similarity를 달성할 수 있음
- 결과적으로 guidance scale $γ S, γ a <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>γ</mi><mrow data-mjx-texclass="ORD"><mi>S</mi></mrow></msub><mo>,</mo><msub><mi>γ</mi><mrow data-mjx-texclass="ORD"><mi>a</mi></mrow></msub></math>$ 는:
  (Eq. 4) $γ (t) = {γ, if t \in (t lo, t hi] 0, otherwise <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>γ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>=</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">{</mo><mtable columnspacing="1em" rowspacing="4pt"><mtr><mtd><mi>γ</mi><mo>,</mo></mtd><mtd><mtext>if</mtext><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>t</mi><mo>\in</mo><mo stretchy="false">(</mo><msub><mi>t</mi><mrow data-mjx-texclass="ORD"><mtext>lo</mtext></mrow></msub><mo>,</mo><msub><mi>t</mi><mrow data-mjx-texclass="ORD"><mtext>hi</mtext></mrow></msub><mo stretchy="false">]</mo></mtd></mtr><mtr><mtd><mn>0</mn><mo>,</mo></mtd><mtd><mtext>otherwise</mtext></mtd></mtr></mtable><mo data-mjx-texclass="CLOSE" fence="true" stretchy="true" symmetric="true"></mo></mrow></math>$
  - Disabled interval에서 autoguidance의 inferior model $s 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 는 fine-tuned model $s 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$ 과 equivalent하다고 볼 수 있음

3. Experiments

- Settings

Dataset : LibriTTS, VCTK, GigaSpeech
Comparisons : XTTS, CosyVoice, UnitSpeech, VoiceTailor

- Results

Problem Statement
- In-domain, OOD, in-the-wild OOD의 3가지 dataset에 대한 SECS를 비교해 보면, data domain에서 멀어질수록 performance gap이 증가함

Model Comparison
- In-the-Wild OOD data인 GigaSpeech에 대해, VoiceGuider는 pre-training에 heavy data를 사용하는 XTTS, CosyVoice와 비슷한 preference를 달성함

Ablation Study
- 먼저 100 training iteration을 사용할 때 최적의 성능을 달성할 수 있음
- LoRA rank $r <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>r</mi></math>$ 측면에서 smaller rank는 meaningful variance를 보이지 않음
- Autoguidance scale은 1일 때 최적의 성능을 보임
- Upper bound가 tighten 하면 adaptation performance가 개선되지만, lower bound가 증가하면 CER은 저하됨

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] DetailTTS: Learning Residual Detail Information for Zero-Shot Text-to-Speech (0)	2025.04.09
[Paper 리뷰] UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts (0)	2025.04.03
[Paper 리뷰] NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers (0)	2025.03.26
[Paper 리뷰] SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow (0)	2025.03.25
[Paper 리뷰] Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization (0)	2025.03.17

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance

VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance

1. Introduction

2. Method

- Background

- VoiceTailor: Parameter-Efficient Baseline

- VoiceGuider: Eliminating LoRA Error with Autoguidance

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역