[Paper 리뷰] UnitSpeech: Speaker-Adaptive Speech Synthesis with Untranscribed Data

티스토리 뷰

Paper/TTS

[Paper 리뷰] UnitSpeech: Speaker-Adaptive Speech Synthesis with Untranscribed Data

feVeRin 2024. 10. 1. 09:50

UnitSpeech: Speaker-Adaptive Speech Synthesis with Untranscribed Data

Minimal untranscribed data를 사용하여 diffusion-based text-to-speech model을 fine-tuning 할 수 있음
UnitSpeech
- Self-supervised unit representation을 pseudo transcript로 사용하고 unit encoder를 pre-trained text-to-speech model에 integrate 함
- Unit encoder를 training 하여 diffusion-based decoder에 speech content를 제공한 다음, single $⟨ unit, speech ⟩ <math xmlns="http://www.w3.org/1998/Math/MathML"><mo fence="false" stretchy="false">⟨</mo><mtext>unit</mtext><mo>,</mo><mtext>speech</mtext><mo fence="false" stretchy="false">⟩</mo></math>$ pair로 decoder를 fine-tuning 해 reference에 대한 speaker adaptation을 지원
논문 (INTERSPEECH 2023) : Paper Link

1. Introduction

Adaptive text-to-speech (TTS) model은 target speaker의 reference speech를 사용하여 personalized voice를 생성함
- 기존에는 YourTTS, AdaSpeech와 같이 target speaker embedding을 활용하거나 few data로 fine-tuning 하는 방식을 사용했음
  - BUT, fine-tuning 방식은 높은 speaker similarity를 달성할 수 있지만, 대부분 speech와 pair 되는 transcript를 요구한다는 한계가 있음
- 최근에는 diffusion model이 text-to-image에서 뛰어난 personalize 성능을 보여 adaptive TTS로 확장되고 있음
- 특히 Guided-TTS는 diffusion model과 classifier guidance를 활용하여 10초 길이의 untranscribed speech로 고품질의 adaptive TTS가 가능함
  - BUT, Guided-TTS는 unconditional generative model이므로 training이 어렵고 time-consuming 하다는 단점이 있음

-> 그래서 few untranscribed speech에 대해 pre-trained diffusion TTS model을 fine-tuning 하여 personalized TTS를 수행하는 UnitSpeech를 제안

UnitSpeech
- Speaker adaptation을 위한 backbone TTS model로써 multi-speaker Grad-TTS를 채택
- Transcript 없이 diffusion decoder에 speech content를 제공하기 위해 self-supervised unit representation을 사용하는 unit encoder를 도입
  - 해당 unit encoder는 input unit을 사용하여 speech content를 diffusion decoder에 condition 하도록 training 됨
- 추가적으로 speaker adaptation을 위해 target speaker의 $⟨ unit, speech ⟩ <math xmlns="http://www.w3.org/1998/Math/MathML"><mo fence="false" stretchy="false">⟨</mo><mtext>unit, speech</mtext><mo fence="false" stretchy="false">⟩</mo></math>$ pair를 사용하여 unit encoder로 condition 된 pre-trained diffusion model을 fine-tuning
  - 결과적으로 UnitSpeech는 target speaker에 맞게 diffusion decoder를 customizing 하여 다양한 adaptive speech synthesis를 지원할 수 있음

< Overall of UnitSpeech >

Speaker adaptation을 위해 unit representation을 도입하고 adaptive synthesis에서 pronunciation accuracy를 개선하는 guidance technique을 도입
Pre-trained TTS model에 대한 pluggable unit encoder를 사용하여 untranscribed speech로 fine-tuning이 가능
결과적으로 기존보다 뛰어난 합성 성능을 달성

2. Method

UnitSpeech는 untranscribed data 만을 사용하여 diffusion-based TTS model을 personalize 하는 것을 목표로 함
- 이때 diffusion model을 transcript 없이 personalize 하기 위해, fine-tuning 중에 text encoder를 대체하고 speech content를 encode 하는 unit encoder를 도입
- 결과적으로 해당 unit encoder를 사용하여 pre-trained TTS model을 다양한 task에서 target speaker에 맞게 adapt 할 수 있음

- Diffusion-based Text-to-Speech Model

UnitSpeech는 pre-trained diffusion-based TTS model로써 multi-speaker Grad-TTS를 채택함
- 구조적으로는 Grad-TTS와 동일하게 text encoder, duration predictor, diffusion-based decoder로 구성되고 multi-speaker를 위한 speaker information을 가짐
  - 여기서 speaker information은 speaker encoder에서 추출한 speaker embedding을 사용
- 먼저 diffusion-based TTS model은 mel-spectrogram $X 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 Gaussian noise $z = X T \sim N (0, I) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>z</mi><mo>=</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub><mo>\sim</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mi>I</mi><mo stretchy="false">)</mo></math>$ 로 변환하는 forward process를 정의한 다음, 해당 process를 reversing 하여 generation을 수행함
  1. 이때 Grad-TTS의 경우 mel-spectrogram-aligned text encoder output을 사용하여 prior distribution을 정의하지만 UnitSpeech는 standard normal distribution을 prior를 사용
  2. 그러면 diffusion model의 forward process는:
    (Eq. 1) $dXt=−12Xtβtdt+√βtdWt,t∈[0,T]<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mi>d</mi><mi>t</mi><mo>+</mo><msqrt><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt><mi>d</mi><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>t</mi><mo>∈</mo><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mi>T</mi><mo stretchy="false">]</mo></math>$
    - $β t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ : pre-defined noise schedule, $W t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ : Wiener process
    - $T <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>T</mi></math>$ 는 1로 설정
- Pre-trained diffusion decoder는 reverse process를 통해 sampling에 필요한 score를 예측함
  1. Pre-training을 위해 data $X 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 는 forward process를 거쳐 noisy data $X t = \sqrt 1 - λ t X 0 + \sqrt λ t ϵ t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><msqrt><mn>1</mn><mo>-</mo><msub><mi>λ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>+</mo><msqrt><msub><mi>λ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 로 corrupt 되고,
  2. Decoder는 aligned text encoder output $c y <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub></math>$ 와 speaker embedding $e S <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>S</mi></mrow></msub></math>$ 를 고려하여 conditional score를 추정하도록 학습됨
  3. 이는 다음의 training objective와 같음:
    (Eq. 2) $L g r a d = E t, X 0, ϵ t [| | \sqrt λ t s θ (X t, t | c y, e S) + ϵ t | | 22] <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>g</mi><mi>r</mi><mi>a</mi><mi>d</mi></mrow></msub><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>,</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>,</mo><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mrow></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msqrt><msub><mi>λ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></msqrt><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>t</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub><mo>,</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>S</mi></mrow></msub><mo stretchy="false">)</mo><mo>+</mo><msub><mi>ϵ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msubsup><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>$
    - $λ t = 1 - e - \int t 0 β s d s, t \in [0, 1] <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>λ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mn>1</mn><mo>-</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>-</mo><msubsup><mo data-mjx-texclass="OP">\int</mo><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msubsup><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mi>d</mi><mi>s</mi></mrow></msup><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>t</mi><mo>\in</mo><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">]</mo></math>$
- 결과적으로 추정된 score $s θ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub></math>$ 는 diffusion decoder output과 동일하므로 model은 다음의 discretized reverse process를 사용하여 transcript와 speaker embedding이 주어졌을 때 mel-spectrogram $X 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 생성할 수 있음:
  (Eq. 3) $Xt−1N=Xt+βtN(12Xt+sθ(Xt,t|cy,eS))+√βtNzt<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>−</mo><mfrac><mn>1</mn><mi>N</mi></mfrac></mrow></msub><mo>=</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>+</mo><mfrac><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mi>N</mi></mfrac><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>+</mo><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>t</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub><mo>,</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>S</mi></mrow></msub><mo stretchy="false">)</mo><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo>+</mo><msqrt><mfrac><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mi>N</mi></mfrac></msqrt><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$
  - $N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi></math>$ : sampling step 수
- 한편으로 (Eq. 2)의 $L g r a d <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>g</mi><mi>r</mi><mi>a</mi><mi>d</mi></mrow></msub></math>$ 외에도 pre-trained TTS model은 Glow-TTS의 Monotonic Alignment Search (MAS)를 사용하여 text encoder output을 mel-spectrogram과 align 함
  1. 이후 일반적으로는 encoder loss $L e n c = MSE (c y, X 0) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>e</mi><mi>n</mi><mi>c</mi></mrow></msub><mo>=</mo><mtext>MSE</mtext><mo stretchy="false">(</mo><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub><mo>,</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 를 사용하여 aligned text encoder output $c y <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub></math>$ 와 mel-spectrogram $X 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 간의 distance를 최소화함
  2. 대신 논문은 text encoder와 speaker identity를 disentangle 하기 위해 text encoder에 speaker embedding $e S <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>S</mi></mrow></msub></math>$ 를 제공하지 않고 speaker-independent representation $c y <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub></math>$ 와 $X 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 간의 distance를 최소화함

- Unit Encoder Training

Pre-trained TTS model만을 사용하여 untranscribed reference data에 대한 고품질 adaptation을 수행하는 것은 어려움
- 따라서 논문은 pre-trained TTS model에 unit encoder를 결합하여 adpatation 성능을 향상함
- 구조적으로 unit encoder는 기존 TTS model의 text encoder와 동일한 구조를 가짐
  1. 대신 unit encoder는 transcript를 input으로 하지 않고 discretized representation인 unit을 사용하여 untranscribed data에 대한 adaptation을 가능하게 함
    - 이때 unit은 self-superviesd speech model인 HuBERT로 얻어지는 discretized representation을 활용
  2. 먼저 unit extraction process에서 speech waveform은 HuBERT의 input으로 제공되고 output representation은 $K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi></math>$ -means clustering을 통해 unit cluster로 discretize 되어 unit sequence를 생성함
    - 여기서 적절한 cluster 수를 설정하여 desired speech content만 포함되도록 cosntrain 할 수 있음
  3. 이후 HuBERT로 얻어진 unit sequence는 mel-spectrogram length로 upsampling 된 다음, unit duration $d u <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>u</mi></mrow></msub></math>$ 로 compress 되어 squeezed unit sequence $u <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>u</mi></math>$ 를 제공함
- Squeezed unit sequence $u <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>u</mi></math>$ 를 input으로 하는 pre-trained TTS model에 plug 된 unit encoder는 기존의 text encoder와 동일한 역할을 수행함
  1. 그러면 unit encoder는 동일한 training objective $L = L g r a d + L e n c <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>g</mi><mi>r</mi><mi>a</mi><mi>d</mi></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>e</mi><mi>n</mi><mi>c</mi></mrow></msub></math>$ 로 training 됨
    - 대신 $c y <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub></math>$ 를 ground-truth duration $d u <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>u</mi></mrow></msub></math>$ 를 사용하여 extend 된 unit encoder output $c u <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>u</mi></mrow></msub></math>$ 로 대체하여 사용
  2. 결과적으로 $c u <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>u</mi></mrow></msub></math>$ 는 $c y <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub></math>$ 와 동일한 space에 배치되므로 fine-tuning 중에 text encoder를 unit encoder로 대체 가능
    - 이때 diffusion decoder는 freeze 되고 unit encoder만 training 됨

- Speaker-Adaptive Speech Synthesis

Pre-trained TTS model과 pluggable unit encoder를 결합하면 target speaker의 single untranscribed speech를 사용하여 다양한 speech synthesis task를 수행할 수 있음
- 먼저 reference speech에서 추출한 squeezed unit $u' <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>u</mi><mo data-mjx-alternate="1">'</mo></msup></math>$ 과 unit duration $d u' <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><msup><mi>u</mi><mo data-mjx-alternate="1">'</mo></msup></mrow></msub></math>$ 을 사용하고, unit encoder로 TTS model의 decoder를 fine-tuning 함
  - 이때 pronunciation deterioration을 최소화하기 위해 unit encoder를 freeze 하고 $c y <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub></math>$ 를 $c u' <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><msup><mi>u</mi><mo data-mjx-alternate="1">'</mo></msup></mrow></msub></math>$ 로 대체한 (Eq. 2) objective를 사용하여 diffusion decoder 만을 training 함
- 그러면 trained model은 input으로 transcript나 unit을 사용하여 adpative speech를 합성할 수 있음
  1. TTS의 경우 fine-tuned decoder에 $c y <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub></math>$ 를 condition으로 제공하여 주어진 transcript에 대한 personalized speech를 생성 가능
  2. Voice Conversion (VC)의 경우 주어진 source speech로부터 HuBERT를 통해 squeezed unit $u <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>u</mi></math>$ 와 unit duration $d u <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>u</mi></mrow></msub></math>$ 를 추출함
    - 이후 $c u <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>u</mi></mrow></msub></math>$ 를 output 하는 unit encoder에 input 된 다음 adaptive diffusion decoder는 $c u <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>u</mi></mrow></msub></math>$ 를 condition으로 하여 converted speech를 생성
- 추가적으로 sampling 중에 classifier-free guidance의 unconditional score를 사용하여 target condition에 대한 conditioning degree를 amplify 하는 방식으로 pronunication을 더욱 향상할 수 있음
  1. 먼저 classifier-free guidance는 unconditional score를 추정하기 위해 unconditional embedding $e Φ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi mathvariant="normal">Φ</mi></mrow></msub></math>$ 가 필요함
  2. 여기서 encoder loss는 output space를 mel-spectrogram에 가깝게 drive 하므로 dataset $c m e l <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>e</mi><mi>l</mi></mrow></msub></math>$ 의 mel-spectrogram 평균으로 $e Φ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi mathvariant="normal">Φ</mi></mrow></msub></math>$ 를 설정함
  3. 결과적으로 classifier-free guidance를 위한 modified score는:
    (Eq. 4) $ˆ s (X t, t | c c, e S) = s (X t, t | c c, e S) + γ \cdot α t <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>s</mi><mo stretchy="false">^</mo></mover></mrow><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>t</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>c</mi></mrow></msub><mo>,</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>S</mi></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mi>s</mi><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>t</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>c</mi></mrow></msub><mo>,</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>S</mi></mrow></msub><mo stretchy="false">)</mo><mo>+</mo><mi>γ</mi><mo>\cdot</mo><msub><mi>α</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$
    
    - $c c <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>c</mi></mrow></msub></math>$ : text/unit encoder의 aligned output
    - $γ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>γ</mi></math>$ : 제공되는 condition information 양을 결정하는 gradient scale

3. Experiments

- Settings

Dataset : LibriTTS
Comparisons
- TTS task : Guided-TTS, YourTTS
- VC task : DiffVC, BNE-PPG-VC

- Results

Adaptive Text-to-Speech
- 전체적으로 UnitSpeech가 가장 우수한 합성 품질을 보임

Any-to-Any Voice Conversion
- VC task도 마찬가지로 UnitSpeech의 성능이 가장 우수함

Analysis
- Number of Unit Clusters
  - Cluster 수 $K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi></math>$ 는 TTS 성능에 큰 영향을 주지 않음
  - BUT, VC task에 대해서는 $K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi></math>$ 가 클수록 더 나은 pronunciation accuracy를 보임
- Fine-Tuning
  - Fine-tuning에 사용된 reference speech 양이 증가할수록 pronunciation accuracy와 speaker similarity가 향상됨
- Gradient Scale in Classifier-Free Guidance
  - Guidance는 speaker similarity를 희생하여 pronunciation을 크게 향상할 수 있음
  - 이때 gradient scale $γ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>γ</mi></math>$ 를 TTS에 대해 1, VC에 대해 1.5로 설정하면 speaker similarity 저하를 최소화하면서 pronunciation 개선을 최대화할 수 있음

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] ClariTTS: Feature-ratio Normalization and Duration Stabilization for Code-Mixed Multi-Speaker Speech Synthesis (0)	2024.10.09
[Paper 리뷰] VoiceTailor: Lightweight Plug-In Adapter for Diffusion-based Personalized Text-to-Speech (0)	2024.10.03
[Paper 리뷰] Fast DCTTS: Efficient Deep Convolutional Text-to-Speech (0)	2024.09.15
[Paper 리뷰] EmoQ-TTS: Emotion Intensity Quantization for Fine-Grained Controllable Emotional Text-to-Speech (4)	2024.07.31
[Paper 리뷰] QI-TTS: Question Intonation Control for Emotional Speech Synthesis (0)	2024.07.30

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] UnitSpeech: Speaker-Adaptive Speech Synthesis with Untranscribed Data

UnitSpeech: Speaker-Adaptive Speech Synthesis with Untranscribed Data

1. Introduction

2. Method

- Diffusion-based Text-to-Speech Model

- Unit Encoder Training

- Speaker-Adaptive Speech Synthesis

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역