[Paper 리뷰] Multi-SpectroGAN: High-Diversity and High-Fidelity Spectrogram Generation with Adversarial Style Combination for Speech Synthesis

티스토리 뷰

Paper/TTS

[Paper 리뷰] Multi-SpectroGAN: High-Diversity and High-Fidelity Spectrogram Generation with Adversarial Style Combination for Speech Synthesis

feVeRin 2024. 7. 28. 11:12

Multi-SpectroGAN: High-Diversity and High-Fidelity Spectrogram Generation with Adversarial Style Combination for Speech Synthesis

일반적으로 text-to-speech에서 adversarial feedback 만으로는 generator를 training 하는데 충분하지 않으므로 추가적인 reconstruction loss가 요구됨
Multi-SpectroGAN
- Generator의 self-supervised hidden representation을 conditional discriminator로 conditioning 하여 adversarial feedback만으로 model을 training 함
- 추가적으로 unseen style에 대한 generalization을 위해 Adversarial Style Combination (ASC)를 도입해 multiple mel-spectrogram에서 combined style embedding을 학습
논문 (AAAI 2021) : Paper Link

1. Introduction

Text-to-Speech (TTS)는 대량의 high-quality text-audio data가 없는 경우 style control/transfer가 어려움
- 이를 위해 FastSpeech와 같이 knowledge distillation을 활용할 수도 있지만, training pipeline이 복잡하다는 한계가 있음
- 한편으로 Generative Adversarial Network (GAN)-based TTS의 경우 adversarial feedback을 통해 음성 품질을 크게 향상할 수 있음
  - 대표적으로 EATS는 input phoneme에서 raw waveform을 생성하여 adversarial feedback과 prediction loss를 통해 end-to-end training 됨
- BUT, adverasrial training을 위해서는 TTS model에서 추가적인 prediction loss를 계산해야 한다는 단점이 있음

-> 그래서 adversarial loss 만으로 training 되는 Mel-SpectroGAN을 제안

Multi-SpectroGAN
- Prediction loss에 대한 의존성을 제거하기 위해 end-to-end learned frame-level condition과 conditional discriminator를 도입
  - Discriminator는 frame-level condition을 사용하여 mel-spectrogram으로 변환되는 feature를 distinguish 하도록 학습되어 generator가 high-fidelity의 mel-spectrogram을 생성하도록 함
- Mixed speaker embedding으로 합성된 mel-spectrogram의 latent representation을 학습할 수 있는 Adversarial Style Combination을 적용
  - Mixed-style mel-spectrogram의 adversarial feedback을 통해 Multi-SpectroGAN은 multiple style을 interpolate 하고, unseen speaker에 대한 natural audio를 합성 가능

< Overall of Multi-SpectroGAN >

End-to-End learned frame-level condition과 conditional discriminator를 통해 prediction loss 없이 mel-spectrogram을 합성
Adversarial Style Combination을 통해 mel-spectrogram의 mixed style을 학습
결과적으로 기존보다 뛰어난 합성 품질을 달성

2. Method

Multi-SpectroGAN은 speaking style을 mixing, controlling 하여 high-diversity mel-spectrogram을 합성할 수 있는 generator를 구축하는 것을 목표로 함
- 이를 위해 multiple mel-spectrogram에서 combined speaker embedding의 latent representation을 학습할 수 있는 Adversarial Style Combination을 도입
- Ground-truth가 없는 randomly mixed style을 학습하기 위해 end-to-end learned frame-level conditional discriminator를 활용

- Generator

Multi-SpectroGAN은 FastSpeech2를 기반으로 variance adaptor $f (\cdot, \cdot) f (\cdot, \cdot) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo stretchy="false">(</mo><mo>\cdot</mo><mo>,</mo><mo>\cdot</mo><mo stretchy="false">)</mo></math>$ , decoder $g (\cdot) g (\cdot) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>g</mi><mo stretchy="false">(</mo><mo>\cdot</mo><mo stretchy="false">)</mo></math>$ 을 사용함
- 구조적으로는 4개의 Feed-Forward Transformer (FFT) block으로 구성된 phoneme encoder와 decoder를 활용
- 이후 multi-speaker model로 확장하여 mel-spectrogram에서 fixed-dimensional style vector를 생성하는 style encoder를 구축
Style Encoder
- Style encoder는 $3 \times 1 3 \times 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>3</mn><mo>\times</mo><mn>1</mn></math>$ filter와 $2 \times 2 2 \times 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>2</mn><mo>\times</mo><mn>2</mn></math>$ stride, dropout, ReLU activation, Layer Normalization을 가지는 2D convolution network로 구성됨
  - 추가적으로 Gated Recurrent Unit layer를 사용해 final output을 single style vector로 compress 함
- Length regulator, variance adaptor의 conditioning 이전에 output은 style information을 추가하기 위해, phoneme encoder와 동일한 dimension으로 project 되고 tanh activation을 추가함
- 그러면 style encoder를 $E s (\cdot) E_{s} (\cdot) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>E</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">(</mo><mo>\cdot</mo><mo stretchy="false">)</mo></math>$ 이라고 했을 때, 얻어지는 style embedding은:
  (Eq. 1) $s = E s (y) s = E_{s} (y) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow><mo>=</mo><msub><mi>E</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow><mo stretchy="false">)</mo></math>$
  - $s s <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow></math>$ : style encoder $E s E_{s} <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>E</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub></math>$ 를 통해 mel-spectrogram $y y <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow></math>$ 로부터 추출된 style embedding
Style-Conditioned Variance Adaptor
- Multi-SpectroGAN은 FastSpeech2의 variance adaptor를 사용하여 variance information을 추가함
- Mel-spectrogram에서 예측된 style embedding을 phoneme hidden sequence $H p h o <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">H</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>h</mi><mi>o</mi></mrow></msub></math>$ 에 추가하여 variance adaptor는 각 speaker의 unique style로 variance information을 예측할 수 있음
  1. 먼저 phoneme-side FFT network를 phoneme hidden representation을 생성하는 phoneme encoder $E p (\cdot) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>E</mi><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msub><mo stretchy="false">(</mo><mo>\cdot</mo><mo stretchy="false">)</mo></math>$ 이라고 하면:
    (Eq. 2) $H p h o = E p (x + PE (\cdot)) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">H</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>h</mi><mi>o</mi></mrow></msub><mo>=</mo><msub><mi>E</mi><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msub><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mo>+</mo><mtext>PE</mtext><mo stretchy="false">(</mo><mo>\cdot</mo><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$
    - $x <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow></math>$ : phoneme embedding sequence, $PE (\cdot) <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>PE</mtext><mo stretchy="false">(</mo><mo>\cdot</mo><mo stretchy="false">)</mo></math>$ : triangle positional embedding
  2. 여기서 Tacotron2에서 target duration sequence $D <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow></math>$ 를 추출하여 phoneme hidden sequence의 length를 mel-spectrogram의 length에 mapping 하면:
    (Eq. 3) $H m e l = LR (H p h o, D) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">H</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>e</mi><mi>l</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">LR</mi></mrow><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">H</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>h</mi><mi>o</mi></mrow></msub><mo>,</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mo stretchy="false">)</mo></math>$
  3. Duration predictor는 Mean-Square Error (MSE)를 통해 log-scale의 length를 예측함:
    (Eq. 4) $L d u r a t i o n = E [| | log (D + 1) - ˆ D | | 2] <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>u</mi><mi>r</mi><mi>a</mi><mi>t</mi><mi>i</mi><mi>o</mi><mi>n</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mo stretchy="false">[</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mo>+</mo><mn>1</mn><mo stretchy="false">)</mo><mo>-</mo><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub><mo stretchy="false">]</mo></math>$
    (Eq. 5) $ˆ D = DurationPredictor (H p h o, s) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mo stretchy="false">^</mo></mover></mrow><mo>=</mo><mtext>DurationPredictor</mtext><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">H</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>h</mi><mi>o</mi></mrow></msub><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow><mo stretchy="false">)</mo></math>$
- 추가적으로 각 mel-spectrogram frame에 대해 target pitch sequence $P <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">P</mi></mrow></math>$ 와 target energy sequence $E <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">E</mi></mrow></math>$ 를 사용함
  1. 이때 각 information의 outlier를 제거하고 normalized value를 사용
  2. 이후 256 value로 divide 되는 quantized $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mn>0</mn></math>$ 와 energy sequence embedding $p, e <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">p</mi></mrow><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow></math>$ 을 추가함:
    (Eq. 6) $p = PitchEmbedding (P), e = EnergyEmbedding (E) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">p</mi></mrow><mo>=</mo><mtext>PitchEmbedding</mtext><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">P</mi></mrow><mo stretchy="false">)</mo><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mo>=</mo><mtext>EnergyEmbedding</mtext><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">E</mi></mrow><mo stretchy="false">)</mo></math>$
  3. Pitch/energy predictor는 normalized $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mn>0</mn></math>$ /energy value를 예측하고, ground-truth $P, E <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">P</mi></mrow><mo>,</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">E</mi></mrow></math>$ 와 예측된 $ˆ P, ˆ E <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">P</mi></mrow><mo stretchy="false">^</mo></mover></mrow><mo>,</mo><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">E</mi></mrow><mo stretchy="false">^</mo></mover></mrow></math>$ 간의 MSE로 학습됨:
    (Eq. 7)
    (Eq. 8) $ˆ P = PitchPredictor (H m e l, s), ˆ E = EnergyPredictor (H m e l, s) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">P</mi></mrow><mo stretchy="false">^</mo></mover></mrow><mo>=</mo><mtext>PitchPredictor</mtext><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">H</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>e</mi><mi>l</mi></mrow></msub><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow><mo stretchy="false">)</mo><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">E</mi></mrow><mo stretchy="false">^</mo></mover></mrow><mo>=</mo><mtext>EnergyPredictor</mtext><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">H</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>e</mi><mi>l</mi></mrow></msub><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow><mo stretchy="false">)</mo></math>$
- Encoder $f (\cdot, \cdot) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo stretchy="false">(</mo><mo>\cdot</mo><mo>,</mo><mo>\cdot</mo><mo stretchy="false">)</mo></math>$ 은 phoneme encoder와 style-conditional variance adpator로 구성되고, variance prediction loss로 training 됨:
  (Eq. 9) $min f L v a r = L d u r a t i o n + L p i t c h + L e n e r g y <math xmlns="http://www.w3.org/1998/Math/MathML"><munder><mo data-mjx-texclass="OP" movablelimits="true">min</mo><mrow data-mjx-texclass="ORD"><mi>f</mi></mrow></munder><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>v</mi><mi>a</mi><mi>r</mi></mrow></msub><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>u</mi><mi>r</mi><mi>a</mi><mi>t</mi><mi>i</mi><mi>o</mi><mi>n</mi></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mi>i</mi><mi>t</mi><mi>c</mi><mi>h</mi></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>e</mi><mi>n</mi><mi>e</mi><mi>r</mi><mi>g</mi><mi>y</mi></mrow></msub></math>$
- Training 중에 논문은 각 information의 ground-truth value 뿐만 아니라 다양한 mel-spectrogram을 학습하기 위해 adversarial style combination을 통해 얻어진 각 information의 predicted value도 활용함
  1. 이때 각 informational hidden sequence의 합 $H t o t a l <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">H</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>o</mi><mi>t</mi><mi>a</mi><mi>l</mi></mrow></msub></math>$ 은 mel-spectrogram을 생성하기 위해 generator $g (\cdot) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>g</mi><mo stretchy="false">(</mo><mo>\cdot</mo><mo stretchy="false">)</mo></math>$ 을 통해 decoder로 전달됨:
    (Eq. 10) $H t o t a l = H m e l + s + p + e + PE (\cdot) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">H</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>o</mi><mi>t</mi><mi>a</mi><mi>l</mi></mrow></msub><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">H</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>e</mi><mi>l</mi></mrow></msub><mo>+</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow><mo>+</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">p</mi></mrow><mo>+</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mo>+</mo><mtext>PE</mtext><mo stretchy="false">(</mo><mo>\cdot</mo><mo stretchy="false">)</mo></math>$
    (Eq. 11) $ˆ y = g (H t o t a l) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow><mo stretchy="false">^</mo></mover></mrow><mo>=</mo><mi>g</mi><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">H</mi></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>o</mi><mi>t</mi><mi>a</mi><mi>l</mi></mrow></msub><mo stretchy="false">)</mo></math>$
    - $ˆ y <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow><mo stretchy="false">^</mo></mover></mrow></math>$ : predicted mel-spectrogram
  2. 한편으로 baseline model은 Mean-Absolute Error (MAE)가 포함된 다음의 reconstruction loss를 활용:
    (Eq. 12) $L r e c = E [| | y - ˆ y | | 1] <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>r</mi><mi>e</mi><mi>c</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mo stretchy="false">[</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow><mo>-</mo><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo stretchy="false">]</mo></math>$
    - $y <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow></math>$ : ground-truth mel-spectrogram

- Discriminator

기존의 GAN-based TTS와 달리 Multi-SpectroGAN은 ground-truth spectrogram에서 직접 loss를 계산하지 않고 text sequence에서 mel-spectrogram을 합성하도록 학습됨
- 이때 $L r e c <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>r</mi><mi>e</mi><mi>c</mi></mrow></msub></math>$ 없이 Multi-SpectroGAN을 training 하기 위해, 논문은 end-to-end learned frame-level condition과 frame-level conditional discriminator를 활용
End-to-End Learned Frame Level Condition
- Frame-level real/generated mel-spectrogram을 distinguish 하기 위해 discriminator는 training 중에 generator에서 학습된 encoder output을 frame-level condition으로 사용함
- 이때 $c <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">c</mi></mrow></math>$ 는 generator에서 학습된 linguistic, style, pitch, energy information의 합:
  (Eq. 13) $c = H m e l ⏟ linguistic + s ⏟ style + p ⏟ pitch + e ⏟ energy <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">c</mi></mrow><mo>=</mo><munder><mrow data-mjx-texclass="OP"><munder><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">H</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>e</mi><mi>l</mi></mrow></msub><mo>⏟</mo></munder></mrow><mtext>linguistic</mtext></munder><mo>+</mo><munder><mrow data-mjx-texclass="OP"><munder><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow><mo>⏟</mo></munder></mrow><mtext>style</mtext></munder><mo>+</mo><munder><mrow data-mjx-texclass="OP"><munder><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">p</mi></mrow><mo>⏟</mo></munder></mrow><mtext>pitch</mtext></munder><mo>+</mo><munder><mrow data-mjx-texclass="OP"><munder><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mo>⏟</mo></munder></mrow><mtext>energy</mtext></munder></math>$
Frame-Level Conditional Discriminator
- 구조적으로 논문은 MelGAN과 유사한 multi-scale discriminator를 채택
  - 이때 다양한 range의 linguistic, pitch, energy information에 대한 feature를 학습하는 것을 목표로 함
- 각 discriminator는 mel-spectrogram side bloock과 condition side block이 있는 4개의 Dblock으로 구성됨
  1. 각 block은 Leaky ReLU activation과 2-layer non-strided 1D convolutional network를 사용하여 adjacent frame information을 추출함
  2. 이후 condition-side block의 hidden representation이 mel-spectrogram side hidden representation에 추가되고, residual connection과 layer normalization이 각 block output에 적용됨
- Multi-SpectroGAN의 training은 Least-Sqaures GAN의 formulation을 따름
  1. 즉, discriminator $D k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow></msub></math>$ 는 real spectrogram $y <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow></math>$ 와 $x, y <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow></math>$ 에서 reconstruct 된 spectrogram을 distinguish 함
  2. 결과적으로 encoder $f (\cdot, \cdot) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo stretchy="false">(</mo><mo>\cdot</mo><mo>,</mo><mo>\cdot</mo><mo stretchy="false">)</mo></math>$ , decoder $g (\cdot) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>g</mi><mo stretchy="false">(</mo><mo>\cdot</mo><mo stretchy="false">)</mo></math>$ , discriminator $D <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>D</mi></math>$ 는 다음의 loss를 통해 최적화됨:
    (Eq. 14)
    (Eq. 15) $L a d v = E [\sum 3 k = 1 | | D k (ˆ y, c) - 1 | | 2] <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>a</mi><mi>d</mi><mi>v</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><munderover><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mn>3</mn></mrow></munderover><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow></msub><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow><mo stretchy="false">^</mo></mover></mrow><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">c</mi></mrow><mo stretchy="false">)</mo><mo>-</mo><mn>1</mn><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>$
Feature Matching
- Discriminator가 학습한 representation을 개선하기 위해 추가적으로 feature matching objective를 도입함
- Real/generated audio의 discriminator feature map 간의 MAE를 최소화하는 MelGAN과 달리, 논문에서는 각 spectrogram-side block의 feature map 간 MAE를 최소화함:
  (Eq. 16) $Lfm=E[∑4i=11Ni||D(i)k(y,c)−D(i)k(ˆy,c)||1]<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>f</mi><mi>m</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mn>4</mn></mrow></munderover><mfrac><mn>1</mn><msub><mi>N</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub></mfrac><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msubsup><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">c</mi></mrow><mo stretchy="false">)</mo><mo>−</mo><msubsup><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow><mo stretchy="false">^</mo></mover></mrow><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">c</mi></mrow><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>$
  - $D (i) k <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup></math>$ : $k <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi></math>$ -th discriminator의 $i <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>i</mi></math>$ -th spectorgram-side block output
  - $N i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>N</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub></math>$ : 각 block output의 unit 수
- 그러면 generator는 다음의 objective로 training 됨:
  (Eq. 17) $min f, g L m s g = L a d v + λ L f m + μ L v a l <math xmlns="http://www.w3.org/1998/Math/MathML"><munder><mo data-mjx-texclass="OP" movablelimits="true">min</mo><mrow data-mjx-texclass="ORD"><mi>f</mi><mo>,</mo><mi>g</mi></mrow></munder><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>s</mi><mi>g</mi></mrow></msub><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>a</mi><mi>d</mi><mi>v</mi></mrow></msub><mo>+</mo><mi>λ</mi><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>f</mi><mi>m</mi></mrow></msub><mo>+</mo><mi>μ</mi><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>v</mi><mi>a</mi><mi>l</mi></mrow></msub></math>$

- Adversarial Style Combination

논문은 unseen style로 다양한 audio signal을 생성하기 위해, multiple source speaker의 mixed style로 mel-spectrogram을 realistic 하게 만드는 Adversarial Style Combination (ASC)를 도입
- 먼저 2가지의 mixing으로써 style embedding 간의 binary selection과 서로 다른 speaker의 style embedding의 linear combination에 대한 manifold mixup을 사용:
  (Eq. 18) $s m i x = α s i + (1 - α) s j <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub><mo>=</mo><mi>α</mi><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>α</mi><mo stretchy="false">)</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow><mrow data-mjx-texclass="ORD"><mi>j</mi></mrow></msub></math>$
  - $α \in {0, 1} <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>α</mi><mo>\in</mo><mo fence="false" stretchy="false">{</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo fence="false" stretchy="false">}</mo></math>$ : Binary selection의 Bernoulli distribution에서 sample 됨
  - $α \in [0, 1] <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>α</mi><mo>\in</mo><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">]</mo></math>$ : Manifold mixup의 $Uniform (0, 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>Uniform</mtext><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">)</mo></math>$ distributuion에서 sample 됨
- Variance adpator는 mixed style embedding을 통해 각 information을 예측함
  1. Pitch/energy와는 달리 duration predictor는 early training step에서 wrong duration을 예측할 수 있으므로 randomly selected ground-truth $D <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow></math>$ 를 사용
    - 각 variance information은 mixed style embedding의 다양한 ratio로 예측되어 style combination을 구성함
  2. 이때 final mixed hidden representation은 서로 다른 mixed style의 각 variance information을 combination 하여 얻어짐:
    (Eq. 19) $H m i x = H m e l + s m i x + p m i x + e m i x ⏟ c m i x + PE (\cdot) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">H</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub><mo>=</mo><munder><mrow data-mjx-texclass="OP"><munder><mrow><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">H</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>e</mi><mi>l</mi></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">p</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub></mrow><mo>⏟</mo></munder></mrow><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">c</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub></munder><mo>+</mo><mtext>PE</mtext><mo stretchy="false">(</mo><mo>\cdot</mo><mo stretchy="false">)</mo></math>$
    (Eq. 20) $ˆ y m i x = g (H m i x) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub><mo>=</mo><mi>g</mi><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">H</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub><mo stretchy="false">)</mo></math>$
    - $p m i x, e m i x <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">p</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub></math>$ : mixed style에서 예측된 pitch/energy embedding
    - $c m i x <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">c</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub></math>$ : style combination으로 생성된 mel-spectrogram $ˆ y m i x <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub></math>$ 에 대한 frame-level condition
- 그러면 discriminator는 다음의 objective로 training 됨:
  (Eq. 21)
- 최종적으로 generator의 training loss는:
  (Eq. 22) $min f, g L a s c = L a d v + λ L f m + μ L v a r + ν L m i x <math xmlns="http://www.w3.org/1998/Math/MathML"><munder><mo data-mjx-texclass="OP" movablelimits="true">min</mo><mrow data-mjx-texclass="ORD"><mi>f</mi><mo>,</mo><mi>g</mi></mrow></munder><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>a</mi><mi>s</mi><mi>c</mi></mrow></msub><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>a</mi><mi>d</mi><mi>v</mi></mrow></msub><mo>+</mo><mi>λ</mi><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>f</mi><mi>m</mi></mrow></msub><mo>+</mo><mi>μ</mi><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>v</mi><mi>a</mi><mi>r</mi></mrow></msub><mo>+</mo><mi>ν</mi><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub></math>$
  (Eq. 23) $L m i x = E [\sum 3 k = 1 | | D k (ˆ y m i x, c m i x) - 1 | | 2] <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><munderover><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mn>3</mn></mrow></munderover><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow></msub><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">c</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>i</mi><mi>x</mi></mrow></msub><mo stretchy="false">)</mo><mo>-</mo><mn>1</mn><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>$

3. Experiments

- Settings

Dataset : LJSpeech, VCTK
Comparisons : TransformerTTS, FastSpeech, FastSpeech2

- Results

Single-Speaker Speech Synthesis
- Single-speaker dataset에 대해 Multi-SpectroGAN (MSG)가 가장 우수한 성능을 달성함

Downsampling size $τ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>τ</mi></math>$ 가 작을수록 MSG는 낮은 CMOS를 보이지만, 수렴 속도는 빨라짐

Loss function 측면에서 MSG는 reconstruction loss $L r e c <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>r</mi><mi>e</mi><mi>c</mi></mrow></msub></math>$ 없이 가장 우수한 MOS를 달성함

Multi-Speaker Speech Synthesis
- Seen speaker의 경우, MSG+ASC 방식이 가장 우수한 성능을 달성함

Unseen Speaker의 경우에도 논문의 MSG+ASC의 성능이 가장 뛰어남

Ablation Study
- Discriminator의 condition에 따른 성능을 비교해 보면
- $H m e l <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">H</mi></mrow><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>e</mi><mi>l</mi></mrow></msub></math>$ 이 없는 모델의 경우 전혀 training 되지 않고, pitch $p <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">p</mi></mrow></math>$ 와 energy $e <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">e</mi></mrow></math>$ 역시 naturalness에 큰 영향을 줌

Style Combination
- Interpolated style embedding으로 합성된 mel-spectrogram을 비교해 보면
- Attention-based autoregressive model과는 달리 MSG는 mel-spectrogram을 mixed-style embedding으로 robust 하게 합성할 수 있음

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] QI-TTS: Question Intonation Control for Emotional Speech Synthesis (0)	2024.07.30
[Paper 리뷰] AILTTS: Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech (0)	2024.07.29
[Paper 리뷰] CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training (0)	2024.07.27
[Paper 리뷰] STEN-TTS: Improving Zero-Shot Cross-Lingual Transfer for Multi-Lingual TTS with Style-Enhanced Normalization Diffusion Framework (0)	2024.07.26
[Paper 리뷰] PVAE-TTS: Adaptive Text-to-Speech via Progressive Style Adaptation (0)	2024.07.25

Let IT Begin Voice Engineer | 심심하면 앨범 리뷰 올립니다

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] Multi-SpectroGAN: High-Diversity and High-Fidelity Spectrogram Generation with Adversarial Style Combination for Speech Synthesis

Multi-SpectroGAN: High-Diversity and High-Fidelity Spectrogram Generation with Adversarial Style Combination for Speech Synthesis

1. Introduction

2. Method

- Generator

- Discriminator

- Adversarial Style Combination

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역