[Paper 리뷰] Personalized Lightweight Text-to-Speech: Voice Cloning with Adaptive Structured Pruning

티스토리 뷰

Paper/TTS

[Paper 리뷰] Personalized Lightweight Text-to-Speech: Voice Cloning with Adaptive Structured Pruning

feVeRin 2024. 1. 10. 13:22

Personalized Lightweight Text-to-Speech: Voice Cloning with Adpative Structured Pruning

Personalized Text-to-Speech를 위해서는 많은 양의 recording과 큰 규모의 모델을 필요로 하므로 mobile device 배포에 적합하지 않음
이를 해결하기 위해 일반적으로 pre-train 된 Text-to-Speech 모델을 fine-tuning 하는 voice cloning을 활용함
- 여전히 pre-train된 대규모 모델에 기반을 두고 있어 한계가 있음
Adaptive Structured Pruning
- Trainable structured pruning을 voice cloning에 적용
- Voice-cloning data로 structured pruning mask를 학습하여 각 target speaker에 대한 unique 한 pruned 모델을 얻음
논문 (ICASSP 2023) : Paper Link

1. Introduction

End-to-End Text-to-Speech (TTS)는 꾸준히 연구되고 있지만, 그에 대한 customization은 충분한 연구가 부족함
- 특히 고품질의 single-speaker end-to-end TTS 시스템은 학습을 위해 많은 양의 data와 큰 규모의 모델을 필요로 함
  - 몇 시간 이상의 음성 recording과 학습 시간을 요구하기 때문에 실용적이지 않음
- 따라서 personalized TTS는 mobile device에서의 활용을 궁극적인 목표로 함
  - 제한된 학습 data, 빠른 학습 속도, 작은 모델 size의 3가지 측면을 모두 만족해야 함
- BUT, 제한된 학습 data를 활용해 scratch로 모델을 학습하는 것은 어려움
  - 고품질 합성을 위해 transfer learning을 주로 활용
  - 이렇게 unseen speaker에 대해 학습된 TTS 모델을 transfer 하는 것을 Voice Cloning이라고 함
TTS 모델 학습 외에도 personalized TTS 작업은 계산 비용, 속도 측면에서 모델 size를 줄이는 것도 중요함
- LightSpeech의 경우, Neural Architecture Search를 활용하여 제한된 환경 내에서 최적의 모델 size를 결정
- Unstructured pruning method를 활용하여 모델 size를 줄일 수도 있음
  - 이 경우 sparse matrix 계산이 까다롭다는 문제가 있음

-> 그래서 personalized TTS의 voice cloning 작업을 경량화하기 위해 learnable structured pruning method를 도입

Adaptive Structured Pruning
- Unstructured pruning과 달리 각 weight matrix의 channel을 제거하여 sparse matrix 보다 더 작은 weight matrix를 생성
  - 아래 그림과 같이 structured pruning은 prune 되지 않은 파란색 부분을 더 작은 matrix로 연결할 수 있기 때문
  - 결과적으로 matrix 계산을 가속화하고 계산 비용을 줄일 수 있음
- Pruning 작업은 weight magnitude와 같은 pruning parameter에 의존
  - 적절한 channel pruning을 위한 learnable mask를 도입
  - 각 target speaker에 대한 personalized pruning이 가능

< Overall of This Paper >

Voice cloning을 위해 structured pruning을 도입
Few-shot data만 사용하여 learnable pruning mask를 학습

Structured Pruning과 Unstructured Pruning의 차이

2. Background

본 논문은 pruning을 적용하는 TTS 모델로써 FastSpeech2를 활용

- Voice Cloning

Voice cloning은 few-shot dataset을 통해 target speaker의 음성에 대한 TTS 모델을 생성함
- 이때, 제한된 data를 활용해 scratch로 학습하는 것은 overfitting 및 합성 품질을 저해함
- 결과적으로 pre-train 된 multi-speaker TTS 모델을 활용해 fine-tuning 하는 것이 선호됨
  - Fine-tuning 가속화를 위해 meta-learning 등의 방법을 활용할 수 있음

- Transformer Block

FastSpeech2는 encoder, decoder, variance adaptor 등으로 구성됨
- Encoder와 Decoder는 transformer block을 stack 하여 구성되고, variance adpator는 CNN layer를 활용함
  - 각 transformer block은 Multi-Head Self-Attention (MHA) layer와 Feed-Forward (FFN) layer를 가짐
- 이때 input $X \in R L \times d <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>L</mi><mo>\times</mo><mi>d</mi></mrow></msup></math>$ 를 사용하여 self-attention layer를 공식화하면:
  $SelfAtt(WQ,WK,WV,X)=softmax(XWQWTKXT√dk)XWV<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>S</mi><mi>e</mi><mi>l</mi><mi>f</mi><mi>A</mi><mi>t</mi><mi>t</mi><mo stretchy="false">(</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>Q</mi></mrow></msub><mo>,</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>K</mi></mrow></msub><mo>,</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>V</mi></mrow></msub><mo>,</mo><mi>X</mi><mo stretchy="false">)</mo><mo>=</mo><mi>s</mi><mi>o</mi><mi>f</mi><mi>t</mi><mi>m</mi><mi>a</mi><mi>x</mi><mo stretchy="false">(</mo><mfrac><mrow><mi>X</mi><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>Q</mi></mrow></msub><msubsup><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>K</mi></mrow><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msubsup><msup><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msup></mrow><msqrt><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow></msub></msqrt></mfrac><mo stretchy="false">)</mo><mi>X</mi><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>V</mi></mrow></msub></math>$
  - $L <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi></math>$ : input $X <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi></math>$ 의 length, $d <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi></math>$ : input $X <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi></math>$ 의 hidden dimension
  - $d k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow></msub></math>$ : self-attention layer의 hidden dimension
  - $W Q, W K, W V \in R d \times d k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>Q</mi></mrow></msub><mo>,</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>K</mi></mrow></msub><mo>,</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>V</mi></mrow></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>d</mi><mo>\times</mo><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow></msub></mrow></msup></math>$ : 각각 query, key, value matrix
- Input $X <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi></math>$ 에 대해 $N h <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>N</mi><mrow data-mjx-texclass="ORD"><mi>h</mi></mrow></msub></math>$ 개의 head를 가진 MHA layer의 output은:
  $M H A (X) = \sum N h i = 1 S e l f A t t (W (i) Q, W (i) K, W (i) V, X) W (i) O <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>M</mi><mi>H</mi><mi>A</mi><mo stretchy="false">(</mo><mi>X</mi><mo stretchy="false">)</mo><mo>=</mo><munderover><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><msub><mi>N</mi><mrow data-mjx-texclass="ORD"><mi>h</mi></mrow></msub></mrow></munderover><mi>S</mi><mi>e</mi><mi>l</mi><mi>f</mi><mi>A</mi><mi>t</mi><mi>t</mi><mo stretchy="false">(</mo><msubsup><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>Q</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup><mo>,</mo><msubsup><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>K</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup><mo>,</mo><msubsup><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>V</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup><mo>,</mo><mi>X</mi><mo stretchy="false">)</mo><msubsup><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>O</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup></math>$
  - $W O \in R d k \times d <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>O</mi></mrow></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">R</mi></mrow><mrow data-mjx-texclass="ORD"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow></msub><mo>\times</mo><mi>d</mi></mrow></msup></math>$ : output matrix
- Up-projection과 down-projection layer를 포함하는 FFN layer는:
  $F F N (X) = R e L U (X W U) W D <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mi>F</mi><mi>N</mi><mo stretchy="false">(</mo><mi>X</mi><mo stretchy="false">)</mo><mo>=</mo><mi>R</mi><mi>e</mi><mi>L</mi><mi>U</mi><mo stretchy="false">(</mo><mi>X</mi><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>U</mi></mrow></msub><mo stretchy="false">)</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>D</mi></mrow></msub></math>$
  - $W U \in R d \times d f, W D \in R d f \times d <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>U</mi></mrow></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>d</mi><mo>\times</mo><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>f</mi></mrow></msub></mrow></msup><mo>,</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>D</mi></mrow></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">R</mi></mrow><mrow data-mjx-texclass="ORD"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>f</mi></mrow></msub><mo>\times</mo><mi>d</mi></mrow></msup></math>$ : 각각 up/down projection
- 결과적으로 transformer block의 output은:
  $X' = L N (M H A (X) + X) <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>X</mi><mo data-mjx-alternate="1">'</mo></msup><mo>=</mo><mi>L</mi><mi>N</mi><mo stretchy="false">(</mo><mi>M</mi><mi>H</mi><mi>A</mi><mo stretchy="false">(</mo><mi>X</mi><mo stretchy="false">)</mo><mo>+</mo><mi>X</mi><mo stretchy="false">)</mo></math>$
  $T r a n s f o r m e r B l o c k (X) = L N (F F N (X') + X') <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>T</mi><mi>r</mi><mi>a</mi><mi>n</mi><mi>s</mi><mi>f</mi><mi>o</mi><mi>r</mi><mi>m</mi><mi>e</mi><mi>r</mi><mi>B</mi><mi>l</mi><mi>o</mi><mi>c</mi><mi>k</mi><mo stretchy="false">(</mo><mi>X</mi><mo stretchy="false">)</mo><mo>=</mo><mi>L</mi><mi>N</mi><mo stretchy="false">(</mo><mi>F</mi><mi>F</mi><mi>N</mi><mo stretchy="false">(</mo><msup><mi>X</mi><mo data-mjx-alternate="1">'</mo></msup><mo stretchy="false">)</mo><mo>+</mo><msup><mi>X</mi><mo data-mjx-alternate="1">'</mo></msup><mo stretchy="false">)</mo></math>$
  - $L N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mi>N</mi></math>$ : layer norm

- Structured Pruning

Unstructured pruning은 weight matrix의 element와 같은 개별 model parameter를 선택하여 제거
Structured pruning은 각 layer의 output에서 제거할 특정 neuron을 선정

- Pruning with $L 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ Regularization

대부분의 pruning method는 parameter magnitude와 같은 criteria를 기반으로 binary pruning mask를 결정함
- BUT, 이러한 criteria는 target speaker에 대한 personalized pruning에 적합하지 않음
- 이를 위해 binary pruning mask에 $L 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ norm을 추가
  - Discrete binary mask는 non-differentiable 하기 때문에, hard-concrete 분포를 활용하여 continuous binary mask로 변환하고 trainable 하게 만듦
  - 이때, regularization term은 모든 mask에 대한 $L 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$ norm과 같음
- 아래의 hard-concrete 분포에서 learnable mask $z <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow></math>$ 를 sampling 하면:
  (Eq.1)
  $u = U (0, 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">u</mi></mrow><mo>=</mo><mi>U</mi><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">)</mo></math>$
  $s=Sigmoid(logu−log(1−u)+logαβ)<math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow><mo>=</mo><mi>S</mi><mi>i</mi><mi>g</mi><mi>m</mi><mi>o</mi><mi>i</mi><mi>d</mi><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mfrac><mrow><mi>l</mi><mi>o</mi><mi>g</mi><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">u</mi></mrow><mo>−</mo><mi>l</mi><mi>o</mi><mi>g</mi><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mo stretchy="false">(</mo><mn>1</mn><mo>−</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">u</mi></mrow><mo stretchy="false">)</mo><mo>+</mo><mi>l</mi><mi>o</mi><mi>g</mi><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>α</mi></mrow><mi>β</mi></mfrac><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$
  $z = m i n (1, m a x (0, γ, + s (η - γ))) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mo>=</mo><mi>m</mi><mi>i</mi><mi>n</mi><mo stretchy="false">(</mo><mn>1</mn><mo>,</mo><mi>m</mi><mi>a</mi><mi>x</mi><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mi>γ</mi><mo>,</mo><mo>+</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow><mo stretchy="false">(</mo><mi>η</mi><mo>-</mo><mi>γ</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$
  - $u <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">u</mi></mrow></math>$ : uniform 분포 $U (0, 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>U</mi><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">)</mo></math>$ 에서 sampling 된 random variable
  - $γ \leq 0, η \geq 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>γ</mi><mo>\leq</mo><mn>0</mn><mo>,</mo><mi>η</mi><mo>\geq</mo><mn>1</mn></math>$ : sigmoid function의 output interval을 $(0, 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">)</mo></math>$ 에서 $(γ, η) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mi>γ</mi><mo>,</mo><mi>η</mi><mo stretchy="false">)</mo></math>$ 로 조정
  - $β <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>β</mi></math>$ : function의 steepness를 조절
- 이때, 주요 learnable masking parameter는 Bernoulli 분포에서 sampling 되는 $z <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow></math>$ 의 logit인 $log α <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>log</mi><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>α</mi></math>$
Weight matrix $W \in R d 1 \times d 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>W</mi><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">R</mi></mrow><mrow data-mjx-texclass="ORD"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>\times</mo><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub></mrow></msup></math>$ 에 대해 weight pruning을 수행하기 위해서는,
- 연관된 learnable mask $z \in R d 1 \times d 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">R</mi></mrow><mrow data-mjx-texclass="ORD"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>\times</mo><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub></mrow></msup></math>$ 를 생성해야 함
  1. Structured learning을 사용하기 위해, 해당 mask는 2개의 learnable masking parameter $α 1 \in R d 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>α</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">R</mi></mrow><mrow data-mjx-texclass="ORD"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></mrow></msup></math>$ , $α 2 \in R d 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>α</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">R</mi></mrow><mrow data-mjx-texclass="ORD"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub></mrow></msup></math>$ 가 필요
    - 해당 parameter는 (Eq.1)을 사용하여 input dimension mask $z 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$ 과 outptu dimension mask $z 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub></math>$ 를 생성
  2. 이후 최종 maks $z = z 1 z T 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><msubsup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msubsup></math>$ 를 얻고,
  3. 이로부터 pruned weight matrix $W' = W ⊙ z <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>W</mi><mo data-mjx-alternate="1">'</mo></msup><mo>=</mo><mi>W</mi><mo>⊙</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow></math>$ 를 얻음
    - $⊙ <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>⊙</mo></math>$ : element-wise dot product

3. Method

- Structured Pruning FastSpeech2

Data에 의해서 결정되는 input/output dimension을 제외한 FastSpeech2의 모든 dimension은 prunable 함
- Prunable dimension의 목록은:
  1. 모델 $d <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi></math>$ 에 영향을 주는 hidden dimension들
    - Encoder/Decoder의 positional encoding
    - 모든 embedding dimension
    - MHA layer의 $W (i) Q, W (i) K, W (i) V, W (i) O <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>Q</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup><mo>,</mo><msubsup><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>K</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup><mo>,</mo><msubsup><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>V</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup><mo>,</mo><msubsup><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>O</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup></math>$
    - FFN layer의 $W U, W D <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>U</mi></mrow></msub><mo>,</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>D</mi></mrow></msub></math>$
    - Layer noramlization의 scale, shift
    - Variance adaptor, Output linear layer의 input channel
  2. $W (i) Q, W (i) K, W (i) V, W (i) O <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>Q</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup><mo>,</mo><msubsup><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>K</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup><mo>,</mo><msubsup><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>V</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup><mo>,</mo><msubsup><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>O</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup></math>$ 에 영향을 주는 MHA layer의 $N h, d (i) k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>N</mi><mrow data-mjx-texclass="ORD"><mi>h</mi></mrow></msub><mo>,</mo><msubsup><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup></math>$
  3. $W U, W D <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>U</mi></mrow></msub><mo>,</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>D</mi></mrow></msub></math>$ 에 영향을 주는 FFN layer의 $d (i) f <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>f</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup></math>$
  4. Variance adaptor와 post-net의 hidden dimension
- Pruning을 위해 위 목록들의 각 dimension에 대해 learnable masking parameter를 생성함
  - 모델 dimension $d <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi></math>$ 는 $α d <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>α</mi><mrow data-mjx-texclass="ORD"><mi>d</mi></mrow></msub></math>$ 로 masking
  - MHA dimension $d (i) k <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup></math>$ 는 $α (i) k <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>α</mi><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup></math>$ 로 masking
  - FFN dimension $d (i) f <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>f</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup></math>$ 는 $α (i) f <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>α</mi><mrow data-mjx-texclass="ORD"><mi>f</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup></math>$ 로 masking
  - MHA head $N h <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>N</mi><mrow data-mjx-texclass="ORD"><mi>h</mi></mrow></msub></math>$ 는 $α h <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>α</mi><mrow data-mjx-texclass="ORD"><mi>h</mi></mrow></msub></math>$ 로 masking
- 학습 시, input/output connection을 기반으로 각 TTS parameter에 대한 mask $z <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow></math>$ 를 생성
  - $d <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi></math>$ 는 residual connection으로 인해 많은 parameter에 영향을 주므로, masking parameter $α d <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>α</mi><mrow data-mjx-texclass="ORD"><mi>d</mi></mrow></msub></math>$ 에 의해 생성된 $z d <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mrow data-mjx-texclass="ORD"><mi>d</mi></mrow></msub></math>$ 를 기반으로 해당 parameter를 $z <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow></math>$ 로 masking

- Optimizing Adaptive Structured Pruning Mask

FastSpeech2의 모든 parameter에 대해 pruning mask $z <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow></math>$ 를 생성하기 위해, learnable masking paremeter $α <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>α</mi></math>$ 를 도입
- 이후 모든 mask에 대한 $L 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$ norm을 계산하여 regularization term으로 사용:
  $L r e g = \sum z | | z | | 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>r</mi><mi>e</mi><mi>g</mi></mrow></msub><mo>=</mo><munder><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow></mrow></munder><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mo stretchy="false">|</mo><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$
- 학습 시작 시, sampling 된 $z <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow></math>$ 가 1에 가까워지도록 모든 $α <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>α</mi></math>$ 값을 큰 값으로 초기화함
- 결과적으로 voice-cloning 모델 pruning에 대한 최종적인 loss는:
  $Ltotal=LTTS+1λLreg=LTTS+1λ∑z||z||1<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>o</mi><mi>t</mi><mi>a</mi><mi>l</mi></mrow></msub><mo>=</mo><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>T</mi><mi>T</mi><mi>S</mi></mrow></msub><mo>+</mo><mfrac><mn>1</mn><mi>λ</mi></mfrac><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>r</mi><mi>e</mi><mi>g</mi></mrow></msub><mo>=</mo><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>T</mi><mi>T</mi><mi>S</mi></mrow></msub><mo>+</mo><mfrac><mn>1</mn><mi>λ</mi></mfrac><munder><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow></mrow></munder><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mo stretchy="false">|</mo><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$
  - $λ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>λ</mi></math>$ : regularization에 대한 weighting factor
  - 논문에서는 $λ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>λ</mi></math>$ 를 TTS parameter의 총 개수로 설정하여, regularization term을 모델의 density로 만듦

- Inference

추론 시에는 (Eq.1)를 사용하여 continuous pruning maks $z <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow></math>$ 를 생성하는 과정을 생략함
- 대신 각 $l o g α <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>l</mi><mi>o</mi><mi>g</mi><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>α</mi></math>$ 로부터 binary pruning mask를 직접 결정
  - 여기서 $S i g m o i d ((l o g α) / β) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>S</mi><mi>i</mi><mi>g</mi><mi>m</mi><mi>o</mi><mi>i</mi><mi>d</mi><mo stretchy="false">(</mo><mo stretchy="false">(</mo><mi>l</mi><mi>o</mi><mi>g</mi><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>α</mi><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo>/</mo></mrow><mi>β</mi><mo stretchy="false">)</mo></math>$ 는 Bernoulli 분포를 나타냄
  - 이때, 경험적으로 대부분의 확률은 0 또는 1에 가까운 값을 가지고 전체의 2%만이 $(0.05, 0.95) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mn>0.05</mn><mo>,</mo><mn>0.95</mn><mo stretchy="false">)</mo></math>$ 범위에 속함
  -> 따라서, $S i g m o i d ((l o g α) / β) = 0.5 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>S</mi><mi>i</mi><mi>g</mi><mi>m</mi><mi>o</mi><mi>i</mi><mi>d</mi><mo stretchy="false">(</mo><mo stretchy="false">(</mo><mi>l</mi><mi>o</mi><mi>g</mi><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>α</mi><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo>/</mo></mrow><mi>β</mi><mo stretchy="false">)</mo><mo>=</mo><mn>0.5</mn></math>$ 를 threshold로 사용
- 결과적으로 $z <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow></math>$ 의 각 element $z i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub></math>$ 와 corresponding element $α i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>α</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub></math>$ 에 대해, 아래의 condition을 계산할 수 있음:
  $z i = {0, S i g m o i d ((l o g α i) / β) < 0.5 1, S i g m o i d ((l o g α i) / β) \geq 0.5 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>z</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">{</mo><mtable columnspacing="1em" rowspacing="4pt"><mtr><mtd><mn>0</mn><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>S</mi><mi>i</mi><mi>g</mi><mi>m</mi><mi>o</mi><mi>i</mi><mi>d</mi><mo stretchy="false">(</mo><mo stretchy="false">(</mo><mi>l</mi><mi>o</mi><mi>g</mi><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><msub><mi>α</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo>/</mo></mrow><mi>β</mi><mo stretchy="false">)</mo><mo><</mo><mn>0.5</mn></mtd></mtr><mtr><mtd><mn>1</mn><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>S</mi><mi>i</mi><mi>g</mi><mi>m</mi><mi>o</mi><mi>i</mi><mi>d</mi><mo stretchy="false">(</mo><mo stretchy="false">(</mo><mi>l</mi><mi>o</mi><mi>g</mi><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><msub><mi>α</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo>/</mo></mrow><mi>β</mi><mo stretchy="false">)</mo><mo>\geq</mo><mn>0.5</mn></mtd></mtr></mtable><mo data-mjx-texclass="CLOSE" fence="true" stretchy="true" symmetric="true"></mo></mrow></math>$

4. Experiments

- Settings

Dataset : LibriTTS, VCTK
Vocoder : MelGAN
FastSpeech2를 기반으로 8-shot voice cloning을 목표로 하여 fine-tuning과 pruning을 적용

- Results

GT : ground-truth, GT + Vocoder : ground-truth with Vocoder
FT : fine-tuning, Prune : pruning with voice-cloning data, Prune' : pruning with pre-training data
- Pre-training data만 사용하여 pruning 된 Prune'의 경우 낮은 성능을 보임
- Voice cloning을 적용한 모델들은 높은 naturalness를 보임

Speaker classifier를 통한 합성 sample에서의 speaker 식별 정확도를 비교해 보면,
- Fine-tuning 모델은 높은 speaker, accent accuracy를 보임
  - 압축되지 않은 voice cloning 모델의 경우, 최고의 성능을 보이지 못함
- Pruning 이후 fine-tuning을 수행했을 때, 가장 높은 accuracy와 두 번째로 높은 압축률 (sparsity)를 달성
  - 결과적으로 pruning 후 fine-tuning이 voice cloning을 위한 가장 안정적인 파이프라인

- Other Pruning Advantages

Pruned 모델은 추론 속도를 2배로 향상하고 GPU 사용량을 절반으로 줄일 수 있음
- Distillation과 비교했을 때, pruning은 scratch 학습이 필요하지 않으므로 학습 시간 단축이 가능
- 추가적으로 pruning은 pre-train 된 고품질 TTS 모델을 활용해 pruning process 전반에 걸쳐 품질 유지가 가능

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] CyFi-TTS: Cyclic Normalizing Flow with Fine-Grained Representation for End-to-End Text-to-Speech (0)	2024.01.18
[Paper 리뷰] SpeedySpeech: Efficient Neural Speech Synthesis (0)	2024.01.17
[Paper 리뷰] LiteTTS: A Lightweight Mel-spectrogram-free Text-to-wave Synthesizer Based on Generative Adversarial Networks (0)	2024.01.08
[Paper 리뷰] Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search (0)	2023.12.20
[Paper 리뷰] Diff-TTS: A Denoising Diffusion Model for Text-to-Speech (0)	2023.12.19

Let IT Begin Voice Engineer | 심심하면 앨범 리뷰 올립니다

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] Personalized Lightweight Text-to-Speech: Voice Cloning with Adaptive Structured Pruning

Personalized Lightweight Text-to-Speech: Voice Cloning with Adpative Structured Pruning

1. Introduction

2. Background

- Voice Cloning

- Transformer Block

- Structured Pruning

- Pruning with $L 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ Regularization

3. Method

- Structured Pruning FastSpeech2

- Optimizing Adaptive Structured Pruning Mask

- Inference

4. Experiments

- Settings

- Results

- Other Pruning Advantages

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

티스토리 뷰

[Paper 리뷰] Personalized Lightweight Text-to-Speech: Voice Cloning with Adaptive Structured Pruning

Personalized Lightweight Text-to-Speech: Voice Cloning with Adpative Structured Pruning

1. Introduction

2. Background

- Voice Cloning

- Transformer Block

- Structured Pruning

- Pruning with L0<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math> Regularization

3. Method

- Structured Pruning FastSpeech2

- Optimizing Adaptive Structured Pruning Mask

- Inference

4. Experiments

- Settings

- Results

- Other Pruning Advantages

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

- Pruning with $L 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ Regularization