[Paper 리뷰] Fast DCTTS: Efficient Deep Convolutional Text-to-Speech

티스토리 뷰

Paper/TTS

[Paper 리뷰] Fast DCTTS: Efficient Deep Convolutional Text-to-Speech

feVeRin 2024. 9. 15. 18:20

Fast DCTTS: Efficient Deep Convolutional Text-to-Speech

Single CPU에서 real-time으로 동작하는 end-to-end text-to-speech model이 필요함
Fast DCTTS
- 다양한 network reduction과 fidelity improvement technique을 적용한 lightweight network
- Gating mechanism의 efficiency와 regularization effect를 고려한 group highway activation을 도입
- 추가적으로 output mel-spectrogram의 fidelity를 측정하는 Elastic Mel-Cepstral Distortion metric을 설계
논문 (ICASSP 2021) : Paper Link

1. Introduction

Tacotron과 같은 기존의 text-to-speech (TTS) model은 우수한 fidelity에 비해 recurrent neural network (RNN)-based architecture로 인해 합성 속도의 한계가 있음
- 한편으로 FastSpeech, FastSpeech2와 같은 non-autoregressive architecture는 TTS의 합성 속도를 크게 개선할 수 있음
- BUT, 해당 방식들은 근본적인 computation reduction 보다는 parallelization에 초점을 맞추고 있음

-> 그래서 computation reduction을 통해 real-time 합성이 가능한 Fast DCTTS를 제안

Fast DCTTS
- Lightweight network를 구성하기 위해 여러 computation reduction과 fidelity improvement technique을 적용
- Gating mechansim의 regularization effect와 computation efficiency를 만족하는 group highway activation을 도입
- 추가적으로 speech quality를 측정하기 위한 metric인 Elastic Mel-Cepstral Distortion (EMCD)를 제안

< Overall of Fast DCTTS >

Efficiency와 regularization effect를 얻을 수 있는 group highway activation을 도입
결과적으로 EMCD를 통해 각 techinque들의 효과를 분석하여 single CPU에서 real-time 합성이 가능한 lightweight TTS model을 설계

2. Optimization Techniques for Neural TTS

- Computational Optimization Techniques

Depthwise Separable Convolution
- 음성 합성에서 depthwise separable convolution은 $channel \times time <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>channel</mtext><mo>\times</mo><mtext>time</mtext></math>$ 의 2D convolution을 1D depthwise convolution과 1D pointwise convolution으로 decompose 함
- 이때 2D convolution의 time complexity는 $O (D K M N D F) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">O</mi></mrow><mo stretchy="false">(</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>K</mi></mrow></msub><mi>M</mi><mi>N</mi><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>F</mi></mrow></msub><mo stretchy="false">)</mo></math>$
  - $D K, D F <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>K</mi></mrow></msub><mo>,</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>F</mi></mrow></msub></math>$ : 각각 convolution filter/feature map size
  - $M, N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>M</mi><mo>,</mo><mi>N</mi></math>$ : 각각 input/output channel 수
- Depthwise, pointwise convolution의 time complexity는 각각 $O (D K M D F), O (M D F D F) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">O</mi></mrow><mo stretchy="false">(</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>K</mi></mrow></msub><mi>M</mi><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>F</mi></mrow></msub><mo stretchy="false">)</mo><mo>,</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">O</mi></mrow><mo stretchy="false">(</mo><mi>M</mi><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>F</mi></mrow></msub><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>F</mi></mrow></msub><mo stretchy="false">)</mo></math>$
  - 이론적으로, depthwise separable convolution을 통한 computation reduction은 $1N+1DK<math xmlns="http://www.w3.org/1998/Math/MathML"><mfrac><mn>1</mn><mi>N</mi></mfrac><mo>+</mo><mfrac><mn>1</mn><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>K</mi></mrow></msub></mfrac></math>$
Group Highway Activation
- 기존의 DCTTS는 다음과 같은 highway activation layer를 사용:
  (Eq. 1) $y = T (x, W T) H (x, W H) + C (x, W C) x <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>y</mi><mo>=</mo><mi>T</mi><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub><mo stretchy="false">)</mo><mi>H</mi><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>H</mi></mrow></msub><mo stretchy="false">)</mo><mo>+</mo><mi>C</mi><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>C</mi></mrow></msub><mo stretchy="false">)</mo><mi>x</mi></math>$
  - $x, y <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi><mo>,</mo><mi>y</mi></math>$ : 각각 input/output feature map, $H (x, W H) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>H</mi><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>H</mi></mrow></msub><mo stretchy="false">)</mo></math>$ : convolution operator
  - $T (x, W T), C (x, W C) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>T</mi><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub><mo stretchy="false">)</mo><mo>,</mo><mi>C</mi><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>C</mi></mrow></msub><mo stretchy="false">)</mo></math>$ : 각각 transformation, carry gate이고 $C (x, W C) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>C</mi><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>C</mi></mrow></msub><mo stretchy="false">)</mo></math>$ 는 $1 - T (x, W T) <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>1</mn><mo>-</mo><mi>T</mi><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub><mo stretchy="false">)</mo></math>$ 로 대체될 수 있음
  - $W T, W C, W H <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub><mo>,</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>C</mi></mrow></msub><mo>,</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>H</mi></mrow></msub></math>$ : gate, operation parameter
- Highway activation의 gate는 neural network 학습에 도움을 주지만, 각 gate의 computational cost가 $H (x, W H) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>H</mi><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>H</mi></mrow></msub><mo stretchy="false">)</mo></math>$ 와 동일하기 때문에 computational burden도 $2 \times, 3 \times <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>2</mn><mo>\times</mo><mo>,</mo><mn>3</mn><mo>\times</mo></math>$ 로 증가함
- 따라서 computation reduction을 위해 highway activation에 대한 다음 2가지 방식들을 고려할 수 있음
  1. Residual Connection
    - $T (x, W T) = C (x, W C) = 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>T</mi><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mi>C</mi><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>C</mi></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mn>1</mn></math>$ 인 highway activation의 특수한 변형
    - BUT, residual DCTTS는 음성 품질 측면에서 저하가 발생하고, 특히 audio encode/decoder에 적용했을 때 skipping이나 repeating 문제가 크게 증가함
  2. Group Highway Activation
    - Feature element를 group으로 combine 하고 group 내의 element는 gate value를 share 하도록 함
    - 결과적으로 group size를 $g <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>g</mi></math>$ 라고 할 때 gate vector size는 $1g<math xmlns="http://www.w3.org/1998/Math/MathML"><mfrac><mn>1</mn><mi>g</mi></mfrac></math>$ 만큼 감소되고, highway activation을 group highway activation으로 대체하면 convolution layer의 computation을 $(1+1g)/2<math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mn>1</mn><mo>+</mo><mfrac><mn>1</mn><mi>g</mi></mfrac><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo>/</mo></mrow><mn>2</mn></math>$ 만큼 줄일 수 있음
    - 이때 $g <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>g</mi></math>$ 를 adjust 하여 computational efficiency와 regularizing effect 간의 trade-off를 만족할 수 있음
Network Size Reduction
- Network size를 줄이면 합성 속도를 향상할 수 있음
- 먼저 기존 DCTTS의 text encoder, audio encoder, audio decoder는 각각 14, 13, 11개의 convolution layer를 가짐
  - Audio encoder, decoder의 각 layer는 256 channel을 가지고, text encoder의 각 layer는 512 channel을 가짐
- 결과적으로 output speech에 대해 layer, channel 수를 적절히 reduction 하여 acceptable fiedity를 만족하는 network를 찾을 수 있음
Network Pruning
- Network pruning도 마찬가지로 computation과 size를 줄일 수 있음
  1. 일반적으로 pruning algorithm은 training 이후 각 unit의 importance value를 추정한 다음, low importance value를 가지는 unit을 제거하는 방식으로 동작함
  2. CNN model의 경우 convolution filter나 feautre map에 적용
- 이때 논문에서는 group highway activation에 대해 pruning method를 적용함
  - 즉, less important feature map을 제거한 다음, feature가 제거된 해당 gate를 pruning 함

- Fidelity Improvement Techniques

Positional encoding은 realtive/absolute position information을 feature vector에 반영하는 방식
- 이를 통해 feature vector 간의 temporal relation을 학습하여 attention stability를 개선할 수 있음
- 논문에서는 TransformerTTS의 scaled positional encoding을 활용함
  1. 여기서 positional information은 $x' i = x i + α PE (p o s, i) <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow><mo data-mjx-alternate="1">'</mo></msubsup><mo>=</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo>+</mo><mi>α</mi><mtext>PE</mtext><mo stretchy="false">(</mo><mi>p</mi><mi>o</mi><mi>s</mi><mo>,</mo><mi>i</mi><mo stretchy="false">)</mo></math>$ 와 같이 feature vector에 더해짐
    - $α <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>α</mi></math>$ : trainable weight
  2. $PE (p o s, i) <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>PE</mtext><mo stretchy="false">(</mo><mi>p</mi><mi>o</mi><mi>s</mi><mo>,</mo><mi>i</mi><mo stretchy="false">)</mo></math>$ 는:
    - $i = 2 k <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>i</mi><mo>=</mo><mn>2</mn><mi>k</mi></math>$ 일 때, $PE(pos,2k)=sin(posbase2k/dim)<math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>PE</mtext><mo stretchy="false">(</mo><mi>p</mi><mi>o</mi><mi>s</mi><mo>,</mo><mn>2</mn><mi>k</mi><mo stretchy="false">)</mo><mo>=</mo><mi>sin</mi><mo data-mjx-texclass="NONE">⁡</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mfrac><mrow><mi>p</mi><mi>o</mi><mi>s</mi></mrow><mrow><mi>b</mi><mi>a</mi><mi>s</mi><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mn>2</mn><mi>k</mi><mrow data-mjx-texclass="ORD"><mo>/</mo></mrow><mi>d</mi><mi>i</mi><mi>m</mi></mrow></msup></mrow></mfrac><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$
    - $i = 2 k + 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>i</mi><mo>=</mo><mn>2</mn><mi>k</mi><mo>+</mo><mn>1</mn></math>$ 일 때, $PE(pos,2k+1)=cos(posbase2k/dim)<math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>PE</mtext><mo stretchy="false">(</mo><mi>p</mi><mi>o</mi><mi>s</mi><mo>,</mo><mn>2</mn><mi>k</mi><mo>+</mo><mn>1</mn><mo stretchy="false">)</mo><mo>=</mo><mi>cos</mi><mo data-mjx-texclass="NONE">⁡</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mfrac><mrow><mi>p</mi><mi>o</mi><mi>s</mi></mrow><mrow><mi>b</mi><mi>a</mi><mi>s</mi><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mn>2</mn><mi>k</mi><mrow data-mjx-texclass="ORD"><mo>/</mo></mrow><mi>d</mi><mi>i</mi><mi>m</mi></mrow></msup></mrow></mfrac><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$
- 추가적으로 scheduled sampling을 성능 향상을 위해 고려할 수 있음
  1. 일반적으로 autoregressive decoder는 previous step에 의존하므로 parallelize 하기 어려움
    - 이때 teacher forcing을 적용하면 previous output 대신 ground-truth mel-spectrogram을 활용하여 autoregressive decoder를 학습할 수 있음
    - BUT, teacher forcing은 ground-truth와 previous output에 대한 discrepancy 문제가 나타날 수 있음
  2. 반면 scheduled sampling은 해당 discrepancy를 완화하기 위해 ground-truth에서 training을 시작해 점점 previous output의 비율을 늘리는 방식

3. Elastic Mel Cepstral Distortion (EMCD)

High-fidelity lightweight TTS model을 얻기 위해서는 speech quality를 정확하게 evaluate 할 수 있어야 함
- 일반적으로는 ground-truth에 대한 distance에 따라 output mel-spectrogram의 품질을 evaluate 함
  - BUT, Euclidean metric은 mel-spectrogram에 skipping이나 repeating이 존재하는 경우 적합하지 않음
- 따라서 논문은 alignment를 고려하여 mel-spectrogram 간의 차이를 계산하는 metric인 Elastic Mel-Cepstral Distortion (EMCD)를 설계:
  (Eq. 2) $D (i, j) = w m \times MCD (x i, y j) + min {D (i, j - 1), D (i - 1, j), D (i - 1, j - 1)} <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>D</mi><mo stretchy="false">(</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo stretchy="false">)</mo><mo>=</mo><msub><mi>w</mi><mrow data-mjx-texclass="ORD"><mi>m</mi></mrow></msub><mo>\times</mo><mtext>MCD</mtext><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo>,</mo><msub><mi>y</mi><mrow data-mjx-texclass="ORD"><mi>j</mi></mrow></msub><mo stretchy="false">)</mo><mo>+</mo><mo data-mjx-texclass="OP" movablelimits="true">min</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">{</mo><mi>D</mi><mo stretchy="false">(</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>-</mo><mn>1</mn><mo stretchy="false">)</mo><mo>,</mo><mi>D</mi><mo stretchy="false">(</mo><mi>i</mi><mo>-</mo><mn>1</mn><mo>,</mo><mi>j</mi><mo stretchy="false">)</mo><mo>,</mo><mi>D</mi><mo stretchy="false">(</mo><mi>i</mi><mo>-</mo><mn>1</mn><mo>,</mo><mi>j</mi><mo>-</mo><mn>1</mn><mo stretchy="false">)</mo><mo data-mjx-texclass="CLOSE">}</mo></mrow></math>$
- 여기서 Mel-Cepstral Distortion (MCD)는 speech signal 간의 perceptual distance를 계산하는 metric으로:
  $MCD (i, j) = \sqrt 2 \sum D d = 1 (x d [i] - y d [j]) 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>MCD</mtext><mo stretchy="false">(</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo stretchy="false">)</mo><mo>=</mo><msqrt><mn>2</mn><munderover><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>d</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>D</mi></mrow></munderover><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>d</mi></mrow></msub><mo stretchy="false">[</mo><mi>i</mi><mo stretchy="false">]</mo><mo>-</mo><msub><mi>y</mi><mrow data-mjx-texclass="ORD"><mi>d</mi></mrow></msub><mo stretchy="false">[</mo><mi>j</mi><mo stretchy="false">]</mo><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup></msqrt></math>$
  - $i = {1, . . ., T s y n}, j = {1, . . ., T g t} <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>i</mi><mo>=</mo><mo fence="false" stretchy="false">{</mo><mn>1</mn><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>y</mi><mi>n</mi></mrow></msub><mo fence="false" stretchy="false">}</mo><mo>,</mo><mi>j</mi><mo>=</mo><mo fence="false" stretchy="false">{</mo><mn>1</mn><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>g</mi><mi>t</mi></mrow></msub><mo fence="false" stretchy="false">}</mo></math>$
  - $x, y <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi><mo>,</mo><mi>y</mi></math>$ : 각각 synthesized/ground-truth Mel-Frequency Cepstral Coefficient (MFCC) sequence
  - $T s y n, T g t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>y</mi><mi>n</mi></mrow></msub><mo>,</mo><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>g</mi><mi>t</mi></mrow></msub></math>$ : synthesize/ground-truth MFCC sequence의 length
- 이때 Dynamic Programming을 통해 optimal alignment를 찾으면서 distance를 계산하도록, Dynamic Time Warping (DTW)와 결합하여 MCD를 확장할 수 있음
  1. 그러면 alignment process의 각 step은 (Eq. 2)와 같이 formulate 되고, $w = [w h o r, w v e r, w d i a g] <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>w</mi><mo>=</mo><mo stretchy="false">[</mo><msub><mi>w</mi><mrow data-mjx-texclass="ORD"><mi>h</mi><mi>o</mi><mi>r</mi></mrow></msub><mo>,</mo><msub><mi>w</mi><mrow data-mjx-texclass="ORD"><mi>v</mi><mi>e</mi><mi>r</mi></mrow></msub><mo>,</mo><msub><mi>w</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>i</mi><mi>a</mi><mi>g</mi></mrow></msub><mo stretchy="false">]</mo></math>$ , $m = arg min {D (i, j - 1), D (i - 1, j), D (i - 1, j - 1)} <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>m</mi><mo>=</mo><mi>arg</mi><mo data-mjx-texclass="NONE"></mo><mo data-mjx-texclass="OP" movablelimits="true">min</mo><mo fence="false" stretchy="false">{</mo><mi>D</mi><mo stretchy="false">(</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>-</mo><mn>1</mn><mo stretchy="false">)</mo><mo>,</mo><mi>D</mi><mo stretchy="false">(</mo><mi>i</mi><mo>-</mo><mn>1</mn><mo>,</mo><mi>j</mi><mo stretchy="false">)</mo><mo>,</mo><mi>D</mi><mo stretchy="false">(</mo><mi>i</mi><mo>-</mo><mn>1</mn><mo>,</mo><mi>j</mi><mo>-</mo><mn>1</mn><mo stretchy="false">)</mo><mo fence="false" stretchy="false">}</mo></math>$
  2. 즉, weight vector $w <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>w</mi></math>$ 는 horizontal, vertical, diagonal transition에 대한 penality를 control 하는 hyperparameter로 구성됨
    - 각각은 repeating, skipping, matching에 해당
- 결과적으로 논문에서는 $w = (1, 1, \sqrt 2) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>w</mi><mo>=</mo><mo stretchy="false">(</mo><mn>1</mn><mo>,</mo><mn>1</mn><mo>,</mo><msqrt><mn>2</mn></msqrt><mo stretchy="false">)</mo></math>$ 로 설정하여 사용함
  1. 이를 통해 matching algorithm이 $(i - 1, j - 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mi>i</mi><mo>-</mo><mn>1</mn><mo>,</mo><mi>j</mi><mo>-</mo><mn>1</mn><mo stretchy="false">)</mo></math>$ 에서 $(i, j) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo stretchy="false">)</mo></math>$ 로 가는 diagonal matching path에 penality weight $w d i a g = \sqrt 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>w</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>i</mi><mi>a</mi><mi>g</mi></mrow></msub><mo>=</mo><msqrt><mn>2</mn></msqrt></math>$ 를 할당하게 함
  2. Same point 간의 rectangular matching trajectory에는 penality weight $w h o r, w v e r = 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>w</mi><mrow data-mjx-texclass="ORD"><mi>h</mi><mi>o</mi><mi>r</mi></mrow></msub><mo>,</mo><msub><mi>w</mi><mrow data-mjx-texclass="ORD"><mi>v</mi><mi>e</mi><mi>r</mi></mrow></msub><mo>=</mo><mn>2</mn></math>$ 를 할당하게 함
  3. 해당 matching 이후, $EMCD (T s y n, T g t) <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>EMCD</mtext><mo stretchy="false">(</mo><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>y</mi><mi>n</mi></mrow></msub><mo>,</mo><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>g</mi><mi>t</mi></mrow></msub><mo stretchy="false">)</mo></math>$ 는 $x, y <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi><mo>,</mo><mi>y</mi></math>$ 간의 EMCD value를 제공함
  4. 이때 추가적으로 evaluation에서 speech length의 효과를 제거하기 위해 $T g t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>g</mi><mi>t</mi></mrow></msub></math>$ 로 EMCD value를 normalize 함
- 해당 EMCD는 horizontal, vertical, diagonal transition에 대한 penality weight를 할당함으로써 repeating, skipping으로 인한 품질 차이를 효과적으로 측정할 수 있음

4. Experiments

- Settings

Dataset : LJSpeech, KSS
Comparisons : DCTTS (2018)

- Results

Depthwise Separable Convolution
- Depthwise separable convolution을 적용한 경우, operation 수가 275B에서 100B로 크게 감소함
- 일반적인 convolution을 사용한 경우, 합성 시간이 6.85초에서 18.16초로 증가함
Group Highway Activation
- Group highway activation (GH)은 일반적인 highway convolution (HC)에 비해 75%의 계산량만을 사용함
- 합성 시간 측면에서 GH DCTTS는 Residual DCTTS와 비교하여 7%의 단축효과를 보임

Network Size Reduction
- KSS dataset에서 $GH_L6_C64 <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>GH_L6_C64</mtext></math>$ 가 가장 빠른 합성 속도를 보임
  - BUT, EMCD 측면에서는 나쁜 성능을 보이므로, 추가적인 quality improvement를 적용
- $GH <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>GH</mtext></math>$ : group highway, $L x <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>L</mtext><mi>x</mi></math>$ : layer 수, $C y <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>C</mtext><mi>y</mi></math>$ : channel 수

Network Pruning and Weight Normalization Trick
- $GH_L6_C64 <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>GH_L6_C64</mtext></math>$ 를 기준으로 network pruning을 적용해 보면, 1.05초에서 0.86초로 18.09%의 단축효과를 보임
  - BUT, aggressive size reduction으로 인해 unrecognizable output이 발생함
- 이때 training 초기에 weight normalization을 수행하고 pruning 이후 reduced network를 fine-tuning 하는 weight normalization trick을 적용하면, EMCD 개선 효과를 얻을 수 있음

Positional Encoding and Scheduled Sampling
- Positional encoding을 적용하면 EMCD를 LJSpeech에서 41.19%, KSS에서 12.16% 개선할 수 있음
- 반면 scheduled sampling의 경우 뚜렷한 성능 향상이 나타나지는 않음
Fast DCTTS
- 결과적으로 Fast DCTTS는 기존 DCTTS와 비교하여 1.6%의 compuation과 2.75%의 parameter 만을 사용함
  - 특히 Single CPU에서는 기존보다 7.45배 빠른 합성 속도를 보임
- EMCD, MOS 측면에서도 Fast DCTTS는 DCTTS와 큰 차이를 보이지 않음

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] VoiceTailor: Lightweight Plug-In Adapter for Diffusion-based Personalized Text-to-Speech (0)	2024.10.03
[Paper 리뷰] UnitSpeech: Speaker-Adaptive Speech Synthesis with Untranscribed Data (0)	2024.10.01
[Paper 리뷰] EmoQ-TTS: Emotion Intensity Quantization for Fine-Grained Controllable Emotional Text-to-Speech (4)	2024.07.31
[Paper 리뷰] QI-TTS: Question Intonation Control for Emotional Speech Synthesis (0)	2024.07.30
[Paper 리뷰] AILTTS: Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech (0)	2024.07.29

Let IT Begin Voice Engineer | 심심하면 앨범 리뷰 올립니다

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] Fast DCTTS: Efficient Deep Convolutional Text-to-Speech

Fast DCTTS: Efficient Deep Convolutional Text-to-Speech

1. Introduction

2. Optimization Techniques for Neural TTS

- Computational Optimization Techniques

- Fidelity Improvement Techniques

3. Elastic Mel Cepstral Distortion (EMCD)

4. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역