[Paper 리뷰] CrossSpeech: Speaker-Independent Acoustic Representation for Cross-Lingual Speech Synthesis

티스토리 뷰

Paper/TTS

[Paper 리뷰] CrossSpeech: Speaker-Independent Acoustic Representation for Cross-Lingual Speech Synthesis

feVeRin 2024. 5. 27. 10:14

CrossSpeech: Speaker-Independent Acoustic Representation for Cross-Lingual Speech Synthesis

Cross-lingual Text-to-Speech 성능은 여전히 intra-lingual 성능보다 떨어짐
CrossSpeech
- Speaker와 language information의 disentangling을 acoustic feature space level에서 효과적으로 disentangling 하여 cross-lingual text-to-speech 성능을 향상
- 이를 위해 Speaker-Independent Generator와 Speaker-Dependent Generator를 도입하고 각 information을 개별적으로 처리함으로써 disentangled speaker, language representation을 얻음
  - Speaker-Independent Generator는 specific speaker distribution에 bias 되지 않는 speaker-independent acoustic representation을 생성
  - Speaker-Dependent Generator는 speaker attribute를 characterize 하는 speaker-dependent variation을 모델링
논문 (ICASSP 2023) : Paper Link

1. Introduction

Text-to-Speech (TTS)는 일반적으로 intra-lingual application에 적합하도록 설계됨
- 즉, 다른 language를 사용하는 source speaker로부터 natural language를 생성하는 cross-lingual TTS의 품질은 여전히 intra-lingual TTS 보다 떨어짐
- 특히, cross-lingual TTS에서 발생하는 합성 품질 저하는 speaker-language entanglement로 인해 발생함
  1. 실제로 training set의 한 source speaker는 하나의 source language만을 speak하므로 speaker identity는 linguistic information에 의존적임
  2. 따라서 source language representation을 target lanugage representation으로 대체하는 경우, speaker identity를 preserve 하기 어려움
- 이러한 speaker-language entanglement 문제를 완화하고 cross-lingual TTS 성능을 향상하기 위해, 크게 2가지 방법을 고려할 수 있음
  1. Multiple language에서 share할 수 있는 language-agnostic text representation을 활용하는 방법
  2. Disentangled speaker/language information을 학습하는 방법
    - BUT, 해당 방식에서 사용된 speaker/linguistic information decomposing은 input token level로 제한적임
    - 즉, input token space에서는 각 representation이 분리되어 있지만, decoder의 input level에서 다시 결합되어 acoustic representation을 생성하므로 speaker-language entanglement가 다시 발생함

-> 그래서 decoder output frame level에서 speaker와 language information을 disentangle 하는 CrossSpeech를 제안

CrossSpeech
- Cross-lingual TTS 성능 향상을 위해 Speaker-Independent Generator (SIG)와 Speaker-Dependent Generator (SDG)를 도입
- SIG는 mix-dynamic speaker layer normalization, speaker generalization loss, speaker-independent pitch predictor를 기반으로 speaker-independent acoustic representation을 생성
- SDG는 dynamic speaker layer normalization, speaker-dependent pitch predictor를 통해 speaker-dependent acoustic representation을 모델링

< Overall of CrossSpeech >

Cross-lingual TTS에서 speaker와 language information의 disentangling을 위해, SIG와 SDG를 도입
결과적으로 speaker similarity 측면에서 뛰어난 성능을 달성하고 cross-lingual TTS의 품질을 향상

2. Model Architecture

CrossSpeech는 online aligner를 채택한 FastPitch를 기반으로 설계됨
- Online aligner는 효율적인 training을 지원하고 각 language에 대해 pre-calculated aligner의 dependency를 제거하므로 cross-lingual TTS에서 language를 extending 하는데 유용함
- 한편으로 cross-lingual TTS에서 speaker-language entanglement를 회피하기 위해, generation pipeline을 speaker-independent module과 speaker-dependent module로 나눌 수 있음
- 따라서 CrossSpeech는 이를 기반으로 speaker-independent representation과 speaker-dependent representation을 각각 모델링하는 SIG와 SDG를 도입함
  1. SIG는 mix-dynamic speaker layer normalization (M-DSLN), speaker-independent pitch (SIP) predictor, speaker generalization loss ( $L s g r <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>g</mi><mi>r</mi></mrow></msub></math>$ )을 사용함
    - 이를 통해 특정 speaker distribution에 bias 되지 않은 speaker-independent acoustic representation $h s i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>i</mi></mrow></msub></math>$ 를 생성
  2. SDG는 dynamic speaker layer normalization (DSLN)과 speaker-dependent pitch (SDP) predictor를 통해 speaker-dependent representation $h s d <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>d</mi></mrow></msub></math>$ 를 모델링
- 추가적으로 CrossSpeech는 single projection layer를 통해 예측된 mel-spectrogram에 $h s i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>i</mi></mrow></msub></math>$ 를 제공함

3. Speaker-Independent Generator

논문은 generalizable speaker-independent representation을 생성하기 위해 M-DSLN과 speaker generalization loss를 포함하는 SIG를 도입함
- 추가적으로 speaker-independent prosodic variation을 학습하기 위해, 합성된 cross-lingual speech의 naturalness와 pitch accuracy를 향상하는 SIP predictor를 구성

- Speaker Generation

DSLN은 단순한 summation이나 concatenation을 대신, speaker embedding을 기반으로 hidden feature를 adaptively modulate 하는 방법
- 먼저, hidden representation $h <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow></math>$ 와 speaker embedding $e s <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub></math>$ 가 주어지면 speaker-conditioned representation은:
  (Eq. 1) $DSLN (h, e s) = W (e s) \otimes LN (h) + b (e s) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-auto-op="false">DSLN</mi></mrow><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">W</mi></mrow><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo>\otimes</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-auto-op="false">LN</mi></mrow><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mo stretchy="false">)</mo><mo>+</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">b</mi></mrow><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo></math>$
  - $\otimes <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>\otimes</mo></math>$ : 1D convolution, $LN <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-auto-op="false">LN</mi></mrow></math>$ : layer normalization
  - Filter weight $W (e s) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">W</mi></mrow><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo></math>$ 와 bias $b (e s) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">b</mi></mrow><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo></math>$ 는 $e s <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub></math>$ 를 input으로 하는 single linear layer에 의해 예측됨
- 다음으로 GenerSpeech와 같이 DSLN을 mix-DSLN (M-DSLN)으로 확장함
  1. 해당 M-DSLN을 통해 text encoding이 specific speaker attribute로 bias 되는 것을 방지하고 generalization capability를 보장할 수 있음
  2. 이를 위해 fitler weight와 bias를 다음과 같이 mix 함:
    (Eq. 2) $W m i x (e s) = γ W (e s) + (1 - γ) W (˜ e s) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">W</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mi>γ</mi><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">W</mi></mrow><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>γ</mi><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">W</mi></mrow><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo></math>$
    (Eq. 3) $b m i x (e s) = γ b (e s) + (1 - γ) b (˜ e s) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">b</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mi>γ</mi><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">b</mi></mrow><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>γ</mi><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">b</mi></mrow><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo></math>$
    - $˜ e s <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub></math>$ : batch dimension을 따라 $e s <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub></math>$ 를 randomly shuffling 하여 얻어짐
    - $γ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>γ</mi></math>$ : Beta distribution $γ \sim Beta (α, α) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>γ</mi><mo>\sim</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-auto-op="false">Beta</mi></mrow><mo stretchy="false">(</mo><mi>α</mi><mo>,</mo><mi>α</mi><mo stretchy="false">)</mo></math>$ 에서 sampling 됨 (논문에서는 $α = 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>α</mi><mo>=</mo><mn>2</mn></math>$ 로 설정)
  3. 해당 mixed speaker information을 기반으로 M-DSLN은:
    (Eq. 4) $M-DSLN (h t, e s) = W m i x (e s) \otimes LN (h t) + b m i x (e s) <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>M-DSLN</mtext><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">W</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo>\otimes</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-auto-op="false">LN</mi></mrow><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">b</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo></math>$
    - $h t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ : text encoder로 예측된 hidden text representation
- 한편으로 generalization 성능을 더욱 향상하기 위해 Kullback-Leibler (KL) divergence에 기반한 speaker generalization loss $L s g r <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>g</mi><mi>r</mi></mrow></msub></math>$ 을 도입함
  1. 해당 loss는 mixed speaker information으로 conditioning 된 text encoding과 original speaker information 간의 consistency를 보장함:
    (Eq. 5) $L o 2 m s g r = KL (DSLN (h t, e s) | | M-DSLN (h t, e s)) <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>g</mi><mi>r</mi></mrow><mrow data-mjx-texclass="ORD"><mi>o</mi><mn>2</mn><mi>m</mi></mrow></msubsup><mo>=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-auto-op="false">KL</mi></mrow><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-auto-op="false">DSLN</mi></mrow><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mtext>M-DSLN</mtext><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$
    (Eq. 6) $L m 2 o s g r = KL (M-DSLN (h t, e s) | | DSLN (h t, e s)) <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>g</mi><mi>r</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mn>2</mn><mi>o</mi></mrow></msubsup><mo>=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-auto-op="false">KL</mi></mrow><mo stretchy="false">(</mo><mtext>M-DSLN</mtext><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mtext>DSLN</mtext><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$
  2. 결과적으로 $L s g r = L o 2 m s g r + L m 2 o s g r <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>g</mi><mi>r</mi></mrow></msub><mo>=</mo><msubsup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>g</mi><mi>r</mi></mrow><mrow data-mjx-texclass="ORD"><mi>o</mi><mn>2</mn><mi>m</mi></mrow></msubsup><mo>+</mo><msubsup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>g</mi><mi>r</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mn>2</mn><mi>o</mi></mrow></msubsup></math>$
- 이러한 M-DSLN과 speaker generalization loss를 채택함으로써, CrossSpeech는 lingustic representation에서 speaker-dependent information을 detach 할 수 있음
  - 추가적으로 다음의 SIP와 duration predictor를 사용하여 speaker-independent variation을 예측함

- Speaker-Independent Pitch Predictor

Cross-lingual TTS는 training 중에 unseen 한 speaker-language combination으로 인해 speech variation을 예측하기 어려움
- 이를 해결하기 위해, CrossSpeech는 SIP predictor를 도입하여 multiple speaker에 대한 common attribute인 text-related pitch variation을 예측함
  - 즉, M-DSLN의 output을 input으로 하여 SIP predictor는 pitch value의 rise/fall을 imply 하는 binary pitch contour sequence를 예측
- 먼저 SIP predictor를 training 하기 위해 모든 frame에 대한 ground-truth pitch value를 추출함
  1. 여기서 ground-truth pitch는 speaker-dependent value이므로, ground-truth pitch seqeunce를 speaker-dependent pitch sequence $p (d) <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">p</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>d</mi><mo stretchy="false">)</mo></mrow></msup></math>$ 로 나타냄
  2. 다음으로 SIP predictor는 input token level에서 pitch value를 처리하므로 ground-truth duration을 사용하여 모든 input token에 대해 $p (d) <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">p</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>d</mi><mo stretchy="false">)</mo></mrow></msup></math>$ 를 평균함
  3. 최종적으로, 평균된 $p (d) <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">p</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>d</mi><mo stretchy="false">)</mo></mrow></msup></math>$ 를 다음의 binary sequence로 변환하여 speaker-independent target pitch sequence $p (i) <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">p</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msup></math>$ 를 얻음:
    (Eq. 7) $p (i) n = {1, ˉ p (d) n - 1 < ˉ p (d) n 0, otherwise <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>n</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup><mo>=</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">{</mo><mtable columnspacing="1em" rowspacing="4pt"><mtr><mtd><mn>1</mn><mo>,</mo></mtd><mtd><msubsup><mrow data-mjx-texclass="ORD"><mover><mi>p</mi><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>n</mi><mo>-</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>d</mi><mo stretchy="false">)</mo></mrow></msubsup><mo><</mo><msubsup><mrow data-mjx-texclass="ORD"><mover><mi>p</mi><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>n</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>d</mi><mo stretchy="false">)</mo></mrow></msubsup></mtd></mtr><mtr><mtd><mn>0</mn><mo>,</mo></mtd><mtd><mtext>otherwise</mtext></mtd></mtr></mtable><mo data-mjx-texclass="CLOSE" fence="true" stretchy="true" symmetric="true"></mo></mrow></math>$
    - $ˉ p (d) n <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mrow data-mjx-texclass="ORD"><mover><mi>p</mi><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>n</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>d</mi><mo stretchy="false">)</mo></mrow></msubsup></math>$ : 평균된 $p (d) <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">p</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>d</mi><mo stretchy="false">)</mo></mrow></msup></math>$ 의 $n <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi></math>$ -th value, $p (i) n <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>n</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup></math>$ : $p (i) <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">p</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msup></math>$ 의 $n <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi></math>$ -th value
    - $n \in {1, 2, 3, . . ., N} <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi><mo>\in</mo><mo fence="false" stretchy="false">{</mo><mn>1</mn><mo>,</mo><mn>2</mn><mo>,</mo><mn>3</mn><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><mi>N</mi><mo fence="false" stretchy="false">}</mo></math>$ 이고, $N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi></math>$ 은 input token length
  4. 이때 해당 SIP predictor를 최적화하기 위해 binary cross-entropy를 사용:
    (Eq. 8) $L s i p = - \sum N n [p (i) n log ˆ p (i) n + (1 - p (i) n) log (1 - ˆ p (i) n)] <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>i</mi><mi>p</mi></mrow></msub><mo>=</mo><mo>-</mo><munderover><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>n</mi></mrow><mrow data-mjx-texclass="ORD"><mi>N</mi></mrow></munderover><mo stretchy="false">[</mo><msubsup><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>n</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup><mi>log</mi><mo data-mjx-texclass="NONE"></mo><msubsup><mrow data-mjx-texclass="ORD"><mover><mi>p</mi><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>n</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msubsup><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>n</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup><mo stretchy="false">)</mo><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msubsup><mrow data-mjx-texclass="ORD"><mover><mi>p</mi><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>n</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup><mo stretchy="false">)</mo><mo stretchy="false">]</mo></math>$
    - $ˆ p (i) n <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mrow data-mjx-texclass="ORD"><mover><mi>p</mi><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>n</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup></math>$ : 예측된 speaker-independent pitch의 $n <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi></math>$ -th value
- Speaker-independent pitch sequence는 1D convolution layer를 통과한 다음, hidden sequence에 추가됨
  1. Resulting sum은 token duration을 기준으로 upsample 된 다음, upsampled hidden sequence를 speaker-independent acoustic representation $h s i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>i</mi></mrow></msub></math>$ 로 변환하는 Feed-Forward Transformer decoder로 전달됨
  2. 이때 CrossSpeech의 duration predictor는 speaker-generalized representation을 input으로 사용하므로 general duration information을 학습할 수 있음
    - 따라서, 이를 통해 speaker identity와 independent 한 token duration을 예측하고 cross-lingual TTS에서 duration prediction을 stabilize 할 수 있음

4. Speaker-Dependent Generator

Speaker-dependent attribute를 모델링하기 위해 DSLN과 SDP predictor로 구성된 SDG를 설계함
- 먼저 DSLN은 speaker embedding $e s <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub></math>$ 와 speaker-independent acoustic representation $h s i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>i</mi></mrow></msub></math>$ 를 input으로 하여 speaker-adapted hidden feature를 생성함
  - 이후, 해당 speaker-adapted hidden feature를 활용하여 SDP predictor는 frame-level에서 speaker-dependent pitch embedding을 생성함
- 이를 위해 speaker-dependent target pitch sequence $p (d) <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">p</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>d</mi><mo stretchy="false">)</mo></mrow></msup></math>$ 를 추출하고, MSE loss를 통해 SDP predictor를 최적화함:
  (Eq. 9) $L s d p = | | p (d) - ˆ p (d) | | 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>d</mi><mi>p</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">p</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>d</mi><mo stretchy="false">)</mo></mrow></msup><mo>-</mo><msup><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">p</mi></mrow><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>d</mi><mo stretchy="false">)</mo></mrow></msup><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mo stretchy="false">|</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub></math>$
  - $ˆ p (d) <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">p</mi></mrow><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>d</mi><mo stretchy="false">)</mo></mrow></msup></math>$ : 예측된 speaker-dependent pitch sequence
  - 여기서 speaker-dependent pitch sequence는 1D convolution layer를 통과한 다음, hidden sequence로 sum 되고, FFT decoder는 adapted hidden sequence로부터 speaker-dependent acoustic representation $h s d <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>d</mi></mrow></msub></math>$ 를 생성함
- 결과적으로 overall training objective $L t o t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>o</mi><mi>t</mi></mrow></msub></math>$ 는:
  (Eq. 10) $L t o t = L r e c + L a l i g n + λ d u r L d u r + λ s g r L s g r + λ s i p L s i p + λ s d p L s d p <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>o</mi><mi>t</mi></mrow></msub><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>r</mi><mi>e</mi><mi>c</mi></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>a</mi><mi>l</mi><mi>i</mi><mi>g</mi><mi>n</mi></mrow></msub><mo>+</mo><msub><mi>λ</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>u</mi><mi>r</mi></mrow></msub><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>u</mi><mi>r</mi></mrow></msub><mo>+</mo><msub><mi>λ</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>g</mi><mi>r</mi></mrow></msub><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>g</mi><mi>r</mi></mrow></msub><mo>+</mo><msub><mi>λ</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>i</mi><mi>p</mi></mrow></msub><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>i</mi><mi>p</mi></mrow></msub><mo>+</mo><msub><mi>λ</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>d</mi><mi>p</mi></mrow></msub><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>d</mi><mi>p</mi></mrow></msub></math>$
  - $L r e c <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>r</mi><mi>e</mi><mi>c</mi></mrow></msub></math>$ : target과 예측된 mel-spectrogram 간의 MSE loss
  - $L a l i g n <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>a</mi><mi>l</mi><mi>i</mi><mi>g</mi><mi>n</mi></mrow></msub></math>$ : online aligner의 alignment loss
  - $L d u r <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>u</mi><mi>r</mi></mrow></msub></math>$ : target과 예측된 duration 간의 MSE loss
  - 논문에서는 $λ d u r = λ s g r = λ s i p = λ s d p = 0.1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>λ</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>u</mi><mi>r</mi></mrow></msub><mo>=</mo><msub><mi>λ</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>g</mi><mi>r</mi></mrow></msub><mo>=</mo><msub><mi>λ</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>i</mi><mi>p</mi></mrow></msub><mo>=</mo><msub><mi>λ</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>d</mi><mi>p</mi></mrow></msub><mo>=</mo><mn>0.1</mn></math>$ 로 설정

5. Experiments

- Settings

Dataset : LJSpeech, English/Chinese/Korean dataset (internal)
Comparisons : FastPitch, SANE-TTS, Cross-Lingual TTS, Multi-Lingual TTS

- Results

Quality Comparison
- CrossSpeech는 cross-lingual setting에서 가장 우수한 성능을 달성함
- Intra-lingual 측면에서도 CrossSpeech는 큰 성능 저하 없이 기존 모델과 비슷한 수준의 합성 품질을 보임

Ablation Study
- Ablation study 측면에서 CrossSpeech의 각 component를 제거하면 성능 저하가 발생함
- 즉, 제안된 component들은 CrossSpeech 성능 향상에 유효함

Acoustic Feature Space
- Speaker generalization capability를 알아보기 위해, (a) pojected speaker-independent acoustic representation과 (b) final mel-spectrogram에 대한 $t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi></math>$ -SNE 결과를 확인해 보면
- (a)에서 embedding은 speaker에 의해 cluster 되지 않고 random 하게 spread 되어 있음
  - 이는 speaker-independent representation이 speaker-related information에 bias 되지 않고 text-related variation만 포함하는 것을 의미함
- (b)의 경우, embedding은 speaker에 따라 well-cluster 되는 것으로 나타남
  - 이는 CrossSpeech가 SDG를 통해 speaker-dependent attribute를 효과적으로 학습할 수 있다는 것을 의미함

Acoustic Feature Space에 대한 $t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi></math>$ -SNE 결과

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech (0)	2024.06.03
[Paper 리뷰] AutoTTS: End-to-End Text-to-Speech Synthesis through Differentiable Duration Modeling (0)	2024.05.30
[Paper 리뷰] DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs (0)	2024.05.26
[Paper 리뷰] CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models (0)	2024.05.25
[Paper 리뷰] DurIAN-E2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis (0)	2024.05.23

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] CrossSpeech: Speaker-Independent Acoustic Representation for Cross-Lingual Speech Synthesis

CrossSpeech: Speaker-Independent Acoustic Representation for Cross-Lingual Speech Synthesis

1. Introduction

2. Model Architecture

3. Speaker-Independent Generator

- Speaker Generation

- Speaker-Independent Pitch Predictor

4. Speaker-Dependent Generator

5. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역