[Paper 리뷰] ClariTTS: Feature-ratio Normalization and Duration Stabilization for Code-Mixed Multi-Speaker Speech Synthesis

티스토리 뷰

Paper/TTS

[Paper 리뷰] ClariTTS: Feature-ratio Normalization and Duration Stabilization for Code-Mixed Multi-Speaker Speech Synthesis

feVeRin 2024. 10. 9. 10:30

ClariTTS: Feature-ratio Normalization and Duration Stabilization for Code-Mixed Multi-Speaker Speech Synthesis

Text-to-Speech model에서 code-mixed text는 speaker-related feature에 source language에 대한 linguistic feature가 포함될 수 있으므로 unnatural accent를 생성할 수 있음
ClariTTS
- Flow-based text-to-speech model에 Feature-ratio Normalized Affine Coupling Layer를 적용
  - Speaker와 linguistic feature를 disentangle 하여 target speaker의 accent가 포함되는 것을 방지
- 추가적으로 stable duration prediction을 보장하기 위해 Duration Stabilization Training Objective를 도입
논문 (INTERSPEECH 2024) : Paper Link

1. Introduction

FastSpeech2, Glow-TTS, FastPitch와 같은 end-to-end Text-to-Speech (TTS)는 human-like speech를 합성할 수 있음
- BUT, multi-lingual TTS는 multi-lingual dataset을 수집하는 것이 어렵기 때문에 mono-lingual data를 결합하여 사용함
  - 이때 mono-lingual data로 multi-lingual TTS model을 학습하면, source language speaker의 accent가 target language에 포함되는 speaker-language entalgement 문제로 인해 unnatural accent가 발생함
- 따라서 speaker identity와 linguistic information을 효과적으로 disentangle 해야 함
  - 대표적으로 SANE-TTS는 domain adversarial training을 사용하고, CrossSpeech는 acoustic representation을 speaker-dependent/speaker-independent로 나누어 사용함
- 한편으로 cross-lingual TTS 외에도 한 sentence에 두 개 이상의 language가 포함된 code-mixed text도 고려할 수 있음
  1. 일반적으로 code-mixed TTS는 encoder structure를 변경하거나 pre-trained model의 additional feature를 활용하거나 transliteration으로 text input을 enriching 하는 방식을 사용함
  2. BUT, 해당 방식은 pre-trained external model 성능이나 transliteration에 의해 크게 좌우됨

-> 그래서 disentangle 문제를 해결하여 code-mixed TTS의 naturalness를 개선한 ClariTTS를 제안

ClariTTS
- 먼저 flow-based TTS model인 VITS를 기반으로 affine coupling layer에 normalization-based conditioning method를 적용
  1. Training phase에서 speaker와 language embedding에 대해 separately predicted parameter를 사용하여 각 input을 normalize 한 다음, speaker/language-normalized result를 add 함
    - 여기서 speaker/language-normalized result를 adding 하는 비율을 적응적으로 결정
  2. 결과적으로 normalizing flow는 speaker/language-dependent data distribution을 speaker/language-independent latent prior distribution으로 변환함
    - 이때 논문은 speaker/language normalization을 개별적으로 사용하므로 training 과정에서 speaker/language feature를 explicitly disentangle 할 수 있음
    - 추론 시 affine coupling layer는 denormalization을 통해 적절한 speaker/language information을 inject 함
- 추가적으로 speaker-language entanglement를 해결하고 robust duration predictor를 구성하기 위해 duration stabilization training objective를 도입
  1. 구체적으로, mini-batch에서 paired input text, speaker embedding, language embedding을 사용하여 intra-speaker duration을 예측함
  2. 이후 batch dimension을 따라 speaker embedding을 randomly shuffle 하고 shuffled speaker embedding으로 cross-speaker duration을 예측함
    - 해당 shuffled speaker embedding에는 mini-batch의 paired speaker embedding과 다른 speaker identity/language information이 포함됨

< Overall of ClariTTS >

Feature-ratio normalization $F R N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mi>R</mi><mi>N</mi></math>$ , Feature-ratio denormalization $D F R N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>D</mi><mi>F</mi><mi>R</mi><mi>N</mi></math>$ 과 Duration stabilization training objective를 활용한 code-mix TTS model
결과적으로 기존보다 뛰어난 합성 품질을 달성

2. Related Work

Glow-TTS, VITS와 같은 flow-based TTS model은 affine coupling layer와 같은 invertible transformation을 통해 simple prior distribution와 complex data dsitribution 간의 bijective mapping을 학습함
- 이때 Speaker-Normalized Affine Coupling Layer (SNAC)을 고려할 수 있음
  1. Training 시 speaker embedding $e s <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub></math>$ 의 예측된 평균/표준편차 parameter로 input을 explicitly nomarlize 하고, 추론 시 desired speaker embedding으로 input을 denormalize 하는 방식
  2. 여기서 normalization $g <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>g</mi></math>$ 와 denormalization $g - 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>g</mi><mrow data-mjx-texclass="ORD"><mo>-</mo><mn>1</mn></mrow></msup></math>$ 은:
    (Eq. 1) $g(x;c)=x−mθ(c)exp(vθ(c))<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>g</mi><mo stretchy="false">(</mo><mi>x</mi><mo>;</mo><mi>c</mi><mo stretchy="false">)</mo><mo>=</mo><mfrac><mrow><mi>x</mi><mo>−</mo><msub><mi>m</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>c</mi><mo stretchy="false">)</mo></mrow><mrow><mi>exp</mi><mo data-mjx-texclass="NONE">⁡</mo><mo stretchy="false">(</mo><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>c</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo></mrow></mfrac></math>$
    (Eq. 2) $g - 1 (x; c) = x ⊙ exp (v θ (c)) + m θ (c) <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>g</mi><mrow data-mjx-texclass="ORD"><mo>-</mo><mn>1</mn></mrow></msup><mo stretchy="false">(</mo><mi>x</mi><mo>;</mo><mi>c</mi><mo stretchy="false">)</mo><mo>=</mo><mi>x</mi><mo>⊙</mo><mi>exp</mi><mo data-mjx-texclass="NONE"></mo><mo stretchy="false">(</mo><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>c</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>+</mo><msub><mi>m</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>c</mi><mo stretchy="false">)</mo></math>$
    - $x \in R D <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>D</mi></mrow></msup></math>$ : input, $c <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>c</mi></math>$ : condition, $⊙ <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>⊙</mo></math>$ : element-wise product
    - $m θ, v θ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>m</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo>,</mo><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub></math>$ : $c <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>c</mi></math>$ 로부터 평균/표준편차를 얻기 위한 simple linear projection
  3. 그러면 위를 따라 speaker normalization $S N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>S</mi><mi>N</mi></math>$ 과 speaker denormalization $S D N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>S</mi><mi>D</mi><mi>N</mi></math>$ 을 얻을 수 있음
    - e.g.) $S N (x; e s) = g (x; e s), S D N (x; e s) = g - 1 (x; e s) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>S</mi><mi>N</mi><mo stretchy="false">(</mo><mi>x</mi><mo>;</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mi>g</mi><mo stretchy="false">(</mo><mi>x</mi><mo>;</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>S</mi><mi>D</mi><mi>N</mi><mo stretchy="false">(</mo><mi>x</mi><mo>;</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><msup><mi>g</mi><mrow data-mjx-texclass="ORD"><mo>-</mo><mn>1</mn></mrow></msup><mo stretchy="false">(</mo><mi>x</mi><mo>;</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo></math>$
  4. 다음으로 SNAC layer의 forward transformation은 affine coupling layer에 $S N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>S</mi><mi>N</mi></math>$ 을 적용하여 얻어짐:
    (Eq. 3) $y 1 : d = x 1 : d <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>y</mi><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>:</mo><mi>d</mi></mrow></msub><mo>=</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>:</mo><mi>d</mi></mrow></msub></math>$
    (Eq. 4) $y d + 1 : D = S N (x d + 1 : D; e s) ⊙ exp (s θ (S N (x 1 : d; e s))) + b θ (S N (x 1 : d; e s)) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>y</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mo>+</mo><mn>1</mn><mo>:</mo><mi>D</mi></mrow></msub><mo>=</mo><mi>S</mi><mi>N</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mo>+</mo><mn>1</mn><mo>:</mo><mi>D</mi></mrow></msub><mo>;</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo>⊙</mo><mi>exp</mi><mo data-mjx-texclass="NONE"></mo><mo stretchy="false">(</mo><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>S</mi><mi>N</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>:</mo><mi>d</mi></mrow></msub><mo>;</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>+</mo><msub><mi>b</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>S</mi><mi>N</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>:</mo><mi>d</mi></mrow></msub><mo>;</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$
  5. Inverse transformation은:
    (Eq. 5) $x 1 : d = y 1 : d <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>:</mo><mi>d</mi></mrow></msub><mo>=</mo><msub><mi>y</mi><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>:</mo><mi>d</mi></mrow></msub></math>$
    (Eq. 6) $xd+1:D=SDN(yd+1:D−bθ(SN(y1:d;es))exp(sθ(SN(y1:d,es)));es)<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mo>+</mo><mn>1</mn><mo>:</mo><mi>D</mi></mrow></msub><mo>=</mo><mi>S</mi><mi>D</mi><mi>N</mi><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mfrac><mrow><msub><mi>y</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mo>+</mo><mn>1</mn><mo>:</mo><mi>D</mi></mrow></msub><mo>−</mo><msub><mi>b</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>S</mi><mi>N</mi><mo stretchy="false">(</mo><msub><mi>y</mi><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>:</mo><mi>d</mi></mrow></msub><mo>;</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo></mrow><mrow><mi>exp</mi><mo data-mjx-texclass="NONE">⁡</mo><mo stretchy="false">(</mo><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>S</mi><mi>N</mi><mo stretchy="false">(</mo><msub><mi>y</mi><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>:</mo><mi>d</mi></mrow></msub><mo>,</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo stretchy="false">)</mo></mrow></mfrac><mo>;</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$
- 결과적으로 위 과정은 forward process에서 speaker information을 제거하고 inverse process에서 제공하는 것으로 볼 수 있음
  - 특히 SNAC은 speaker-dependent data distribution을 speaker-independent prior distribution으로 변환하여 model이 inverse process를 통해 desired speaker-dependent data distribution을 얻을 수 있도록 함
- 따라서 논문은 위 방식을 training 시 각 speaker/language embedding에 의해 normalize 된 input을 add 하고, 추론 시 denormalized input을 add 하는 speaker-language conditioning method로 확장함
  - 즉, $S N, S D N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>S</mi><mi>N</mi><mo>,</mo><mi>S</mi><mi>D</mi><mi>N</mi></math>$ 을 각각 $F R N, F R D N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mi>R</mi><mi>N</mi><mo>,</mo><mi>F</mi><mi>R</mi><mi>D</mi><mi>N</mi></math>$ 으로 대체하여 speaker/language embedding에 의해 input을 selectively normalize/denormalize 하도록 함

3. Method

Text encoder, duration predictor, normalizing flow, posterior encoder/decoder를 가지는 VITS를 backbone으로 사용함
- 먼저 각 language에 대해 native character와 one-hot language ID를 사용하고, reference encoder를 통해 linear-scale spectrogram에서 speaker-related feature를 추출함
  - 해당 reference encoder output은 speaker embedding으로 사용됨
- 추가적으로 빠른 추론을 위해 VITS의 decoder를 multi-stream iSTFT-VITS decoder로 대체하고, 안정적인 추론을 위해 stochastic duration predictor를 deterministic duration predictor로 대체함
- 위를 기반으로 ClariTTS는 다음의 구성요소를 도입:
  1. Normalzing flow의 affine coupling layer에서 $F R N, F R D N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mi>R</mi><mi>N</mi><mo>,</mo><mi>F</mi><mi>R</mi><mi>D</mi><mi>N</mi></math>$ 을 적용
  2. Duration predictor를 위한 duration stabilization training objective를 사용
- 이때 ClariTTS는 speaker와 language feature를 disentangle 하는 것을 목표로 하므로 normalizing flow와 duratino predictor에만 speaker/language embedding을 제공함

- Feature-ratio Normalized Affine Coupling Layer

ClariTTS architecture는 SNAC에서 $S N, S D N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>S</mi><mi>N</mi><mo>,</mo><mi>S</mi><mi>D</mi><mi>N</mi></math>$ 을 $F R N, F D R N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mi>R</mi><mi>N</mi><mo>,</mo><mi>F</mi><mi>D</mi><mi>R</mi><mi>N</mi></math>$ 으로 대체하여 구성됨
- $F R N, F D R N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mi>R</mi><mi>N</mi><mo>,</mo><mi>F</mi><mi>D</mi><mi>R</mi><mi>N</mi></math>$ 은 각각 speaker embedding $e s <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub></math>$ 와 language embedding $e l <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>l</mi></mrow></msub></math>$ 에서 얻은 평균/표준편차 parameter로 normalize/denormalize 됨
- (Eq. 1), (Eq. 2)의 $c <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>c</mi></math>$ 에 $e l <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>l</mi></mrow></msub></math>$ 를 대입하면, language normalization $L N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mi>N</mi></math>$ 과 language denormalization $L D N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mi>D</mi><mi>N</mi></math>$ 을 얻을 수 있음
  1. 여기서 shared convolutional neural network $W r <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>r</mi></mrow></msub></math>$ 에서 $e s, e l <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo>,</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>l</mi></mrow></msub></math>$ 을 사용하여 feature-ratio $ρ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ρ</mi></math>$ 를 얻음:
    (Eq. 7) $ρ = σ (W r (m θ (e s), v θ (e s)) + W r (m θ (e l), v θ (e l))) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ρ</mi><mo>=</mo><mi>σ</mi><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>r</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>m</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo>,</mo><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>+</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>r</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>m</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>l</mi></mrow></msub><mo stretchy="false">)</mo><mo>,</mo><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>l</mi></mrow></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$
    - $σ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>σ</mi></math>$ : sigmoid function
  2. 그러면 $F R N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mi>R</mi><mi>N</mi></math>$ 은:
    (Eq. 8) $F R N (x; e s, l) = ρ (S N (x; e s)) + (1 - ρ) (L N (x; e l)) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mi>R</mi><mi>N</mi><mo stretchy="false">(</mo><mi>x</mi><mo>;</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mi>l</mi></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mi>ρ</mi><mo stretchy="false">(</mo><mi>S</mi><mi>N</mi><mo stretchy="false">(</mo><mi>x</mi><mo>;</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>ρ</mi><mo stretchy="false">)</mo><mo stretchy="false">(</mo><mi>L</mi><mi>N</mi><mo stretchy="false">(</mo><mi>x</mi><mo>;</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>l</mi></mrow></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$
    - $F R D N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mi>R</mi><mi>D</mi><mi>N</mi></math>$ 은 $F R N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mi>R</mi><mi>N</mi></math>$ 의 inverse transformation
  3. 결과적으로 affine coupling layer의 forward transformation은 다음과 같이 유도됨:
    (Eq. 9) $y 1 : d = x 1 : d <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>y</mi><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>:</mo><mi>d</mi></mrow></msub><mo>=</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>:</mo><mi>d</mi></mrow></msub></math>$
    
    - $s θ, b θ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo>,</mo><msub><mi>b</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub></math>$ : 각각 scale, bias function, $d < D <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi><mo><</mo><mi>D</mi></math>$
  4. Inverse transformation은:
    (Eq. 10) $x 1 : d = y 1 : d <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>:</mo><mi>d</mi></mrow></msub><mo>=</mo><msub><mi>y</mi><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>:</mo><mi>d</mi></mrow></msub></math>$
    $xd+1:D=FRDN(yd+1:D−bθ(FRN(y1:d;es,l))exp(sθ(FRN(y1:d;es,l)));es,l)<math xmlns="http://www.w3.org/1998/Math/MathML"><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mo>+</mo><mn>1</mn><mo>:</mo><mi>D</mi></mrow></msub><mo>=</mo><mi>F</mi><mi>R</mi><mi>D</mi><mi>N</mi><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mfrac><mrow><msub><mi>y</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mo>+</mo><mn>1</mn><mo>:</mo><mi>D</mi></mrow></msub><mo>−</mo><msub><mi>b</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>F</mi><mi>R</mi><mi>N</mi><mo stretchy="false">(</mo><msub><mi>y</mi><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>:</mo><mi>d</mi></mrow></msub><mo>;</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mi>l</mi></mrow></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo></mrow><mrow><mi>exp</mi><mo data-mjx-texclass="NONE">⁡</mo><mo stretchy="false">(</mo><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>F</mi><mi>R</mi><mi>N</mi><mo stretchy="false">(</mo><msub><mi>y</mi><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>:</mo><mi>d</mi></mrow></msub><mo>;</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mi>l</mi></mrow></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo stretchy="false">)</mo></mrow></mfrac><mo>;</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mi>l</mi></mrow></msub><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$
- 한편으로 coupling structure로 인해 Jacobian은 lower triangular matrix로 얻어짐
  1. 이때 Jacobian은:
    (Eq. 11) $∂yd+1:D∂xd+1:D=diag(exp(sθ(FRN(x1:d;es,l)))⊙ρexp(vθ(el))+(1−ρ)exp(vθ(es))exp(vθ(es))exp(vθ(el)))<math xmlns="http://www.w3.org/1998/Math/MathML"><mfrac><mrow><mi>∂</mi><msub><mi>y</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mo>+</mo><mn>1</mn><mo>:</mo><mi>D</mi></mrow></msub></mrow><mrow><mi>∂</mi><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mo>+</mo><mn>1</mn><mo>:</mo><mi>D</mi></mrow></msub></mrow></mfrac><mo>=</mo><mtext>diag</mtext><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mi>exp</mi><mo data-mjx-texclass="NONE">⁡</mo><mo stretchy="false">(</mo><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>F</mi><mi>R</mi><mi>N</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>:</mo><mi>d</mi></mrow></msub><mo>;</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mi>l</mi></mrow></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>⊙</mo><mfrac><mrow><mi>ρ</mi><mi>exp</mi><mo data-mjx-texclass="NONE">⁡</mo><mo stretchy="false">(</mo><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>l</mi></mrow></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>−</mo><mi>ρ</mi><mo stretchy="false">)</mo><mi>exp</mi><mo data-mjx-texclass="NONE">⁡</mo><mo stretchy="false">(</mo><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo></mrow><mrow><mi>exp</mi><mo data-mjx-texclass="NONE">⁡</mo><mo stretchy="false">(</mo><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mi>exp</mi><mo data-mjx-texclass="NONE">⁡</mo><mo stretchy="false">(</mo><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>l</mi></mrow></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo></mrow></mfrac><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$
    (Eq. 12) $∂y∂x=[Id×d0∂yd+1:D∂x1:d∂yd+1:D∂xd+1:D]<math xmlns="http://www.w3.org/1998/Math/MathML"><mfrac><mrow><mi>∂</mi><mi>y</mi></mrow><mrow><mi>∂</mi><mi>x</mi></mrow></mfrac><mo>=</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><mtable columnspacing="1em" rowspacing="4pt"><mtr><mtd><msub><mi>I</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mo>×</mo><mi>d</mi></mrow></msub></mtd><mtd><mn>0</mn></mtd></mtr><mtr><mtd><mfrac><mrow><mi>∂</mi><msub><mi>y</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mo>+</mo><mn>1</mn><mo>:</mo><mi>D</mi></mrow></msub></mrow><mrow><mi>∂</mi><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>:</mo><mi>d</mi></mrow></msub></mrow></mfrac></mtd><mtd><mfrac><mrow><mi>∂</mi><msub><mi>y</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mo>+</mo><mn>1</mn><mo>:</mo><mi>D</mi></mrow></msub></mrow><mrow><mi>∂</mi><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mo>+</mo><mn>1</mn><mo>:</mo><mi>D</mi></mrow></msub></mrow></mfrac></mtd></mtr></mtable><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>$
    - $I d \times d <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>I</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mo>\times</mo><mi>d</mi></mrow></msub></math>$ : identity matrix
  2. Simplicity를 위해 논문은 normalizing flow를 volume-preserving transformation으로 설계함
    - 즉, scale function $exp (s θ (F R N (x 1 : d; e s, l))) = 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>exp</mi><mo data-mjx-texclass="NONE"></mo><mo stretchy="false">(</mo><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>F</mi><mi>R</mi><mi>N</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>:</mo><mi>d</mi></mrow></msub><mo>;</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mi>l</mi></mrow></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>=</mo><mn>1</mn></math>$ 이 됨
  3. 따라서 ClariTTS의 normalizing flow에 대한 Jacobian log-determinant는:
    (Eq. 13) $log|det∂fθ(x)∂x|=log∑jρexp(vθ(el)j)+(1−ρ)exp(vθ(es)j)exp(vθ(es)j)exp(vθ(el)j)<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">|</mo><mo data-mjx-texclass="OP" movablelimits="true">det</mo><mfrac><mrow><mi>∂</mi><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow><mrow><mi>∂</mi><mi>x</mi></mrow></mfrac><mo data-mjx-texclass="CLOSE">|</mo></mrow><mo>=</mo><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><munder><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>j</mi></mrow></munder><mfrac><mrow><mi>ρ</mi><mi>exp</mi><mo data-mjx-texclass="NONE">⁡</mo><mo stretchy="false">(</mo><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>l</mi></mrow></msub><msub><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mi>j</mi></mrow></msub><mo stretchy="false">)</mo><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>−</mo><mi>ρ</mi><mo stretchy="false">)</mo><mi>exp</mi><mo data-mjx-texclass="NONE">⁡</mo><mo stretchy="false">(</mo><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><msub><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mi>j</mi></mrow></msub><mo stretchy="false">)</mo></mrow><mrow><mi>exp</mi><mo data-mjx-texclass="NONE">⁡</mo><mo stretchy="false">(</mo><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><msub><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mi>j</mi></mrow></msub><mo stretchy="false">)</mo><mi>exp</mi><mo data-mjx-texclass="NONE">⁡</mo><mo stretchy="false">(</mo><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>l</mi></mrow></msub><msub><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mi>j</mi></mrow></msub><mo stretchy="false">)</mo></mrow></mfrac></math>$
- Affine coupling layer는 $F R N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mi>R</mi><mi>N</mi></math>$ 을 통해 forward transformation에서 speaker, language information을 제거함
  1. 이때 $F R N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mi>R</mi><mi>N</mi></math>$ 은 각 hidden channel에 대해 $ρ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ρ</mi></math>$ 를 사용하여 speaker, language information을 adaptively eliminate 함
    - 이를 통해 normalizing flow는 data distribution을 speaker/language-independent latent prior distribution으로 변환 가능
  2. $F R D N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mi>R</mi><mi>D</mi><mi>N</mi></math>$ 은 inverse transformation 동안 target speaker/language embedding을 통해 information을 제공함
    - 이를 통해 prior distribution을 speaker/language-dependent data distribution으로 변환

- Duration Stabilization Training Objectives

Cross-lingual TTS 성능을 향상하기 위해서는 speaker-language entanglement 문제를 해결해야 함
- 따라서 ClariTTS는 $F R N, F R D N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mi>R</mi><mi>N</mi><mo>,</mo><mi>F</mi><mi>R</mi><mi>D</mi><mi>N</mi></math>$ 외에도 duration predictor를 stabilize 하고 speaker-language entanglement를 완화하는 training objective를 도입함
- 구체적으로, mini-batch가 주어졌을 때 $(text, audio, e s, e l) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mtext>text</mtext><mo>,</mo><mtext>audio</mtext><mo>,</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo>,</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>l</mi></mrow></msub><mo stretchy="false">)</mo></math>$ 과 같은 paired input이 있다고 하자
  1. 그러면 duration predictor는 intra-speaker duration $d i n t r a <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>n</mi><mi>t</mi><mi>r</mi><mi>a</mi></mrow></msub></math>$ 를 생성함:
    (Eq. 14) $d i n t r a = W d (x; e s, l) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>n</mi><mi>t</mi><mi>r</mi><mi>a</mi></mrow></msub><mo>=</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>d</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo>;</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mi>l</mi></mrow></msub><mo stretchy="false">)</mo></math>$
    - $W d <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>d</mi></mrow></msub></math>$ : duration predictor, $x <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi></math>$ : input text embedding
  2. 이때 mini-batch에서 $e s <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub></math>$ 를 randomly shuffle 하여 shuffled speaker embedding $ˉ e s = shuffle (e s) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mi>e</mi><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo>=</mo><mtext>shuffle</mtext><mo stretchy="false">(</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo></math>$ 를 얻을 수 있음
    - $ˉ e s <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mi>e</mi><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub></math>$ 를 통해 duration을 생성하는 경우, $ˉ e s <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mi>e</mi><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub></math>$ 는 $e s <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub></math>$ 와 비교하여 other speaker나 language에 대한 information을 포함하고 있으므로 duration predictor는 cross-speaker duration $d c r o s s <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>r</mi><mi>o</mi><mi>s</mi><mi>s</mi></mrow></msub></math>$ 를 생성한다고 볼 수 있음
  3. 따라서 duration stabilization loss는:
    (Eq. 15) $L d u r = L d i n t r a + L d c r o s s = MSE (d m a s, d i n t r a) + MSE (d m a s, d c r o s s) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>u</mi><mi>r</mi></mrow></msub><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>n</mi><mi>t</mi><mi>r</mi><mi>a</mi></mrow></msub></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>r</mi><mi>o</mi><mi>s</mi><mi>s</mi></mrow></msub></mrow></msub><mo>=</mo><mtext>MSE</mtext><mo stretchy="false">(</mo><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>a</mi><mi>s</mi></mrow></msub><mo>,</mo><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>n</mi><mi>t</mi><mi>r</mi><mi>a</mi></mrow></msub><mo stretchy="false">)</mo><mo>+</mo><mtext>MSE</mtext><mo stretchy="false">(</mo><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>a</mi><mi>s</mi></mrow></msub><mo>,</mo><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>r</mi><mi>o</mi><mi>s</mi><mi>s</mi></mrow></msub><mo stretchy="false">)</mo></math>$
    - $MSE <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>MSE</mtext></math>$ : Mean Squared Error, $d m a s <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>a</mi><mi>s</mi></mrow></msub></math>$ : monotonic alignment search를 통한 duration
- $L d u r <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>u</mi><mi>r</mi></mrow></msub></math>$ 를 통해 duration predictor는 speaker embedding에서 linguistic feature를 추출하지 않고도 speaker-related feature를 추출할 수 있음
  - 결과적으로 해당 duration predictor는 speaker와 language embedding을 개별적으로 활용하기 때문에 robust cross-lingual TTS가 가능

4. Experiments

- Settings

Dataset : AIHub Multi-Speaker Dataset, LibriTTS
Comparisons : MS-iSTFT-VITS, YourTTS, SANE-TTS

- Results

Intra-lingual/Cross-lingual/Code-mixed 모두에서 ClariTTS가 가장 우수한 성능을 달성함

Parameter 수 측면에서 ClariTTS가 가장 효율적임

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech (0)	2024.10.19
[Paper 리뷰] PL-TTS: A Generalizable Prompt-based Diffusion TTS Augmented by Large Language Model (2)	2024.10.12
[Paper 리뷰] VoiceTailor: Lightweight Plug-In Adapter for Diffusion-based Personalized Text-to-Speech (0)	2024.10.03
[Paper 리뷰] UnitSpeech: Speaker-Adaptive Speech Synthesis with Untranscribed Data (0)	2024.10.01
[Paper 리뷰] Fast DCTTS: Efficient Deep Convolutional Text-to-Speech (0)	2024.09.15

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] ClariTTS: Feature-ratio Normalization and Duration Stabilization for Code-Mixed Multi-Speaker Speech Synthesis

ClariTTS: Feature-ratio Normalization and Duration Stabilization for Code-Mixed Multi-Speaker Speech Synthesis

1. Introduction

2. Related Work

3. Method

- Feature-ratio Normalized Affine Coupling Layer

- Duration Stabilization Training Objectives

4. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역