[Paper 리뷰] VoiceTailor: Lightweight Plug-In Adapter for Diffusion-based Personalized Text-to-Speech

티스토리 뷰

Paper/TTS

[Paper 리뷰] VoiceTailor: Lightweight Plug-In Adapter for Diffusion-based Personalized Text-to-Speech

feVeRin 2024. 10. 3. 10:02

VoiceTailor: Lightweight Plug-In Adapter for Diffusion-based Personalized Text-to-Speech

Pre-trained diffusion-based model에 personalized adapter를 결합하여 parameter-efficient speaker adaptive Text-to-Speech를 수행할 수 있음
VoiceTailor
- Parameter-Efficient Adaptation을 위해 Low-Rank Adaptation을 활용하고 adapter를 pre-trained diffusion decoder의 pivotal module에 통합
- Few parameter 만으로 강력한 adaptation을 달성하기 위해 guidance technique과 speaker information strengthening을 채택
논문 (INTERSPEECH 2024) : Paper Link

1. Introduction

Adaptive Text-to-Speech (TTS)를 위해 zero-shot 방식과 one-shot 방식을 고려할 수 있음
- VoiceBox, VALL-E, P-Flow와 같은 zero-shot 방식은 speaker adaptation을 위해 reference audio에 대한 추가적인 fine-tuning 과정이 필요하지 않다는 장점이 있음
  - BUT, 높은 speaker similarity를 달성하기 위해 일반적으로 training 중에 대용량의 speech corpus가 필요함
- 한편으로 AdaSpeech와 같은 one-shot 방식은 target speaker에 대한 few reference speech를 통해 pre-trained TTS model을 fine-tuning 하여 personalized TTS를 수행함
  1. 이때 target speaker에 대한 efficient adaptation을 위해 parameter subset을 활용하거나 Low-Rank Adaptation (LoRA)를 적용하거나, prefix-tuning 등을 사용하여 adapter의 parameter를 fine-tuning 할 수 있음
  2. BUT, 해당 방식은 decoder model의 한계로 인해 speaker similarity가 제한적이고, fine-tuning 과정에서도 1분 이상의 긴 speech data가 필요하다는 단점이 있음
- 한편으로 Guided-TTS, UnitSpeech와 같이 fine-tuning-based personalized generation을 위해 DDPM을 기반으로 diffusion-based one-shot TTS model을 구축할 수 있음
  1. 해당 방식들은 diffusion model의 adaptation 성능을 기반으로 5~10초의 짧은 reference speech 만으로도 높은 speaker similairty를 달성 가능함
  2. BUT, 기존 one-shot 방식과는 달리 모든 model parameter를 fine-tuning 하므로 parameter-inefficiency가 존재

-> 그래서 pre-trained diffusion-based TTS model에서 parameter-subset 만을 fine-tuning 하는 parameter-efficient adaptive TTS model인 VoiceTailor를 제안

VoiceTailor
- Pre-trained diffusion-based model을 기반으로 UnitSpeech의 fine-tuning method를 활용
  - 이때 fine-tuning 전후의 module weight에 대한 change ratio를 분석하여 attention module의 효과를 검증
- 결과적으로 attention module에 LoRA를 적용하고 adaptation을 위해 inject된 low-rank matrix 만을 fine-tuning
- 추가적으로 parameter-efficient adaptation stage에서 최적 hyperparameter 선택과 inference stage에서의 guidance technique을 적용해 speaker information을 향상

< Overall of VoiceTailor >

Diffusion-based speaker adaptive TTS를 위해 LoRA를 적용
추가적으로 LoRA module과 classifier-free guidance를 활용하여 speaker information을 향상
결과적으로 single GPU에서 전체 parameter의 0.25%만을 활용하여 speaker adaptation cost를 크게 절감

2. Method

VoiceTailor는 기존 diffusion-based one-shot TTS의 parameter-inefficiency를 해결하기 위해 LoRA를 활용한 personalized TTS model을 구축하는 것을 목표로 함
- 먼저 VoiceTailor는 LoRA fine-tuning과 reference audio에서 추출한 speaker embedding을 통해 target speaker characteristic을 capture 함
- 추가적으로 논문은 UnitSpeech의 weight change ratio를 분석하여 speaker information을 향상함
- 결과적으로 VoiceTailor는 최적의 LoRA weight를 injection 하고 guidance strategy를 적용함으로써, 전체 parameter의 0.25%만을 fine-tuning 하여 personalized TTS를 달성함

- UnitSpeech

논문은 UnitSpeech를 기반으로 one-shot TTS를 수행함
- 먼저 UnitSpeech는 short untranscribed speech sample을 기반으로 pre-trained multi-speaker diffusion-based TTS model을 fine-tuning 해 personalized TTS model을 구성함
- 이때 multi-speaker diffusion-based TTS model은 Grad-TTS를 기반으로 mel-spectrogram $X 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 Guassian noise $X T \sim N (0, I) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub><mo>\sim</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mi>I</mi><mo stretchy="false">)</mo></math>$ 로 변환하는 forward process를 정의함
  1. 해당 forward process는 pre-defined noise schedule $β t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 와 Wiener process $W t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 를 통해 정의됨
  2. 즉, forward process에서 time step $t \in [0, T] <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi><mo>\in</mo><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mi>T</mi><mo stretchy="false">]</mo></math>$ 의 noisy mel-spectrogram $X t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 는:
    (Eq. 1) $dXt=−12Xtβtdt+√βtdWt,t∈[0,1]<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mi>d</mi><mi>t</mi><mo>+</mo><msqrt><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt><mi>d</mi><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>t</mi><mo>∈</mo><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">]</mo></math>$
    (Eq. 2) $X t = \sqrt e - \int t 0 β s d s X 0 + \sqrt 1 - e - \int s 0 β s d s ϵ t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><msqrt><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>-</mo><msubsup><mo data-mjx-texclass="OP">\int</mo><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msubsup><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mi>d</mi><mi>s</mi></mrow></msup></msqrt><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>+</mo><msqrt><mn>1</mn><mo>-</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>-</mo><msubsup><mo data-mjx-texclass="OP">\int</mo><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msubsup><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mi>d</mi><mi>s</mi></mrow></msup></msqrt><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$
    - $ϵ t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ : standard normal distribution에서 sampling 된 noise
- Forward process의 reverse trajectory를 따라 mel-spectrogram을 sampling 하려면 text encoder output $c y <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub></math>$ 와 pre-trained speaker encoder에서 추출된 speaker embedding $e S <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>S</mi></mrow></msub></math>$ 로 condition 된 score $s (X t | c y, e S) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>s</mi><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub><mo>,</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>S</mi></mrow></msub><mo stretchy="false">)</mo></math>$ 를 사용해야 함
  1. 여기서 UnitSpeech의 diffusion-based decoder $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$ 는 conditional score $s θ (X t | c y, e S) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub><mo>,</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>S</mi></mrow></msub><mo stretchy="false">)</mo></math>$ 를 예측하도록 training 됨
  2. 그러면 decoder pre-training에 대한 loss function과 sampling을 위한 predicted score는:
    (Eq. 3) $L = E t, X 0, ϵ t [| | \sqrt 1 - e - \int t 0 β s d s s θ (X t | c y, e S) + ϵ t | | 22] <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>,</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>,</mo><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mrow></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><msubsup><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">|</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">|</mo><msqrt><mn>1</mn><mo>-</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>-</mo><msubsup><mo data-mjx-texclass="OP">\int</mo><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msubsup><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mi>d</mi><mi>s</mi></mrow></msup></msqrt><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub><mo>,</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>S</mi></mrow></msub><mo stretchy="false">)</mo><mo>+</mo><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo data-mjx-texclass="CLOSE">|</mo></mrow><mo data-mjx-texclass="CLOSE">|</mo></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>$
    (Eq. 4) $Xt−Δt=Xt+βt(12Xt+sθ(Xt|cy,eS))Δt+√βtΔtzt<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>−</mo><mi mathvariant="normal">Δ</mi><mi>t</mi></mrow></msub><mo>=</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>+</mo><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>+</mo><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub><mo>,</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>S</mi></mrow></msub><mo stretchy="false">)</mo><mo data-mjx-texclass="CLOSE">)</mo></mrow><mi mathvariant="normal">Δ</mi><mi>t</mi><mo>+</mo><msqrt><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mi mathvariant="normal">Δ</mi><mi>t</mi></msqrt><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$
    - $z t \sim N (0, I) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>\sim</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mi>I</mi><mo stretchy="false">)</mo></math>$ : Gaussian Noise
- 이를 기반으로 UnitSpeech는 pre-trained diffusion decoder를 untranscribed speech로 fine-tuning 하는 unit encoder를 도입해 text input을 없이 speaker adaptation이 가능하게 함
  1. 여기서 Unit encoder는 phonetic information을 포함하는 self-supervised speech representation인 acoustic unit을 input으로 하여 text encoder를 대체함
  2. 즉, UnitSpeech는 text encoder를 해당 pluggable unit encoder로 대체하고 pre-trained decoder와 동일한 objective로 training 한 다음, reference audio와 unit을 통해 fine-tuning 하여 speaker adaptation을 수행함
- 추가적으로 UnitSpeech는 diffusion model에서 conditioning information을 향상하는 classifier-free guidance를 text encoder output $c y <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub></math>$ 에 결합해 pronunciation accuracy를 개선함
  1. 이를 확장하여 VoiceTailor는 classifier-free guidance를 speaker embedding $e S <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>S</mi></mrow></msub></math>$ 에도 적용함
  2. 즉, multi-speaker TTS model을 pre-training 하는 동안 learnable unconditional embedding $e ϕ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow></msub></math>$ 를 도입하고 25%의 확률로 $e ϕ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>ϕ</mi></mrow></msub></math>$ 를 $e S <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>S</mi></mrow></msub></math>$ 로 대체하도록 함

- Parameter-Efficient Speaker Adaptation

VoiceTailor는 speaker adaptation 과정에서 모든 parameter를 fine-tuning 하는 inefficiency를 해결하기 위해 parameter-efficient adaptation method인 LoRA를 채택함
- LoRA는 trainable low-rank decomposed matrix를 결합하여 linear layer의 weight matrix를 fine-tuning 함
  1. Linear layer의 pre-trained weight $W \in R d \times k <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>W</mi><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>d</mi><mo>\times</mo><mi>k</mi></mrow></msup></math>$ 가 주어지면 LoRA는 $W + α \cdot Δ W = W + α \cdot B A <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>W</mi><mo>+</mo><mi>α</mi><mo>\cdot</mo><mi mathvariant="normal">Δ</mi><mi>W</mi><mo>=</mo><mi>W</mi><mo>+</mo><mi>α</mi><mo>\cdot</mo><mi>B</mi><mi>A</mi></math>$ 로 augment 함
  2. 여기서 parameter $Δ W := <math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">Δ</mi><mi>W</mi><mo>:=</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>L</mi><mi>o</mi><mi>R</mi><mi>A</mi></mrow></msub></math>$ 는 $W$ 가 frozen 된 상태에서 fine-tuning 됨
    - $B \in R^{d \times r}, A \in R^{r \times k}, α$ : adapter matrix의 scaling factor, $r$ : rank
  3. 결과적으로 LoRA는 original matrix의 dimension $d, k$ 에 비해 rank $r$ 을 상당히 작은 값으로( $r ≪ d, k$ ) 사용함으로써 더 적은 parameter로 adaptation을 수행함
    - 이때 pre-trained model parameter를 $θ$ , fine-tuned adapter $W_{L o R A}$ 가 있는 model parameter를 $θ^{*}$ 로 나타냄
- UnitSpeech를 기반으로 모든 decoder parameter를 fine-tuning 해 speaker adaptation을 수행한 다음, 어떤 module이 speaker adaptation에서 중요하게 사용되는지 분석할 수 있음
  1. 이를 위해 논문은 pre-training 전후의 각 module $θ_{i}$ 에 대한 weight change ratio $| | θ_{i}^{*} - θ_{i} | | / | | θ_{i} | |$ 를 계산함
  2. 결과적으로 UnitSpeech의 diffusion decoder 내에서 attention module과 다른 module 간의 평균 change ratio를 계산해 보면 각각 0.0282와 0.0050으로 얻어짐
    - 즉, attention module이 one-shot diffusion TTS model의 adaptation에서 가장 중요하게 사용됨
- 따라서 VoiceTailor는 attention module에 LoRA를 inject 하여 speaker adaptation을 위한 parameter를 최적화함

- Speaker Information Strengthening Strategies

Fine-tuned adapter는 pre-trained multi-speaker TTS model과 결합되어 personalized TTS를 구성할 수 있음
- VoiceTailor에서 speaker information은 speaker embedding $e_{S}$ 와 pluggable LoRA weight $W_{L o R A}$ 의 2가지 form으로 제공됨
  - 이때 parameter 감소로 인한 speaker adaptation 성능 저하를 완화하기 위해, target speaker information을 strengthen 하는 sampling method를 활용 가능
- 따라서 논문은 LoRA의 scaling factor $α$ 를 fine-tuning 중에 사용되는 값보다 큰 값으로 설정하고, classifier-free guidance를 적용함
Adjustment of LoRA Scaling Factor
- $α$ 는 speaker adaptation을 위해 pre-trained model에 adapter가 add 되는 intensity를 control 함
- 여기서 training 중에 사용되는 것보다 generation 중에 더 큰 $α$ 를 사용함으로써 low-rank adapter에 stronger speaker information을 제공할 수 있음
Classifier-Free Guidance
- VoiceTailor에서 speaker information은 $e_{S}, W_{L o R A}$ 2가지로 구성되므로 각 source에 대해 classifier-free guidance를 적용함
- 먼저 fine-tuned model $s_{θ^{*}} (X_{t} | c, e_{S})$ 의 score가 주어졌을 때, unconditional score $s_{u n c o n}$ 에 대한 다음 3가지 candidate를 고려할 수 있음:
  1. $s_{θ^{*}} (X_{t} | c, e_{ϕ})$ : $W_{L o R A}$ 가 제공하는 speaker information을 유지하면서 $e_{S}$ 를 unconditional embedding $e_{ϕ}$ 로 대체하여 얻어짐
  2. $s_{θ} (X_{t} | c, e_{S})$ : Pre-trained model $θ$ 에서 $W_{L o R A}$ 를 제거하고 $e_{S}$ 를 input으로 유지하여 얻어짐
  3. $s_{θ} (X_{t} | c, e_{ϕ})$ : $e_{s}, W_{L o R A}$ 의 모든 speaker information이 없는 $s_{u n c o n}$
- 그러면 modified score $\hat{s}$ 는 앞선 unconditional score를 기반으로 다음의 classifier-free guidance를 적용하여 계산됨:
  (Eq. 5) ${\hat{s}}_{θ^{*}} (X_{t} | c, e_{S}) = s_{θ^{*}} (X_{t} | c, e_{S}) + γ_{S} \cdot (s_{θ^{*}} (X_{t} | c, e_{S}) - s_{u n c o n})$
  - $γ_{S}$ : additional speaker information의 intensity를 결정하는 gradient scale
- 결과적으로 VoiceTailor는 $s_{u n c o n} = s_{θ^{*}} (X_{t} | c, e_{ϕ})$ 를 채택하여 sample을 생성함
  - $α$ 와 $s_{u n c o n}$ 에 대한 3가지 candidate audjusting과 $s_{u n c o n} = s_{θ^{*}} (X_{t} | c, e_{ϕ})$ 를 통한 classifier-free guidance 외의 방식은 speaker adaptation에 악영향을 주기 때문

3. Experiments

- Settings

Dataset : LibriTTS
Comparisons : UnitSpeech, XTTS, YourTTS

- Results

Model Comparison
- 전체적으로 VoiceTailor가 가장 우수한 성능을 보임
- 특히 VoiceTailor는 0.25%의 parameter만 fine-tuning 하여 UnitSpeech 수준의 speaker similarity를 달성할 수 있음

Parameter-Efficient Fine-Tuning
- Attention이 아닌 linear layer에 trainable low-rank matrix를 추가해도 pronunciation accuracy와 speaker similarity는 개선되지 않음
  - 즉, attention module이 speaker adapatation에서 가장 중요함
- Fine-tuning 중 $W_{L o R A}$ 의 scale은 $α$ 에 의해 결정됨
  - $α = 1$ 과 같이 너무 작은 값을 사용하지 않는 한 비슷한 수준의 speaker similairty를 달성 가능
- $r = 2$ 와 같은 작은 LoRA rank는 SECS를 저하하는 대신 39K의 parameter로 동작하는 VoiceTailor를 얻을 수 있음

Speaker Information Strengthening Method
- Speaker embedding $E_{S} (S_{u n c o n} = s_{θ^{*}} (X_{t} | c, e_{ϕ}))$ 에 기반한 classifier-free guidance를 제외하면, 다른 method는 speaker adaptation 성능을 저하함
- 특히 fine-tuning에 사용된 값보다 LoRA scaling factor $α$ 를 증가시키면 CER, SECS가 모두 저하됨
  - 따라서 논문은 $γ_{S} = 1$ 인 speaker embedding guidance만 사용함

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] PL-TTS: A Generalizable Prompt-based Diffusion TTS Augmented by Large Language Model (2)	2024.10.12
[Paper 리뷰] ClariTTS: Feature-ratio Normalization and Duration Stabilization for Code-Mixed Multi-Speaker Speech Synthesis (0)	2024.10.09
[Paper 리뷰] UnitSpeech: Speaker-Adaptive Speech Synthesis with Untranscribed Data (0)	2024.10.01
[Paper 리뷰] Fast DCTTS: Efficient Deep Convolutional Text-to-Speech (0)	2024.09.15
[Paper 리뷰] EmoQ-TTS: Emotion Intensity Quantization for Fine-Grained Controllable Emotional Text-to-Speech (4)	2024.07.31

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] VoiceTailor: Lightweight Plug-In Adapter for Diffusion-based Personalized Text-to-Speech

VoiceTailor: Lightweight Plug-In Adapter for Diffusion-based Personalized Text-to-Speech

1. Introduction

2. Method

- UnitSpeech

- Parameter-Efficient Speaker Adaptation

- Speaker Information Strengthening Strategies

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역