[Paper 리뷰] FastPitchFormant: Source-Filter based Decomposed Modeling for Speech Synthesis

티스토리 뷰

Paper/TTS

[Paper 리뷰] FastPitchFormant: Source-Filter based Decomposed Modeling for Speech Synthesis

feVeRin 2024. 12. 21. 09:55

FastPitchFormant: Source-Filter based Decomposed Modeling for Speech Synthesis

Text-to-Speech에서 large pitch-shift scale은 품질 저하와 speaker characteristic deformation을 일으킴
FastPitchFormant
- Source-Filter theory를 기반으로 설계된 Feed-Forward Transformer model
- Text, acoustic feature를 개별적으로 modeling 하여 model이 두 feature 간의 relationship을 학습하는 것을 방지
논문 (INTERSPEECH 2021) : Paper Link

1. Introduction

Text-to-Speech (TTS)는 주어진 sentence에 해당하는 natural voice를 합성하는 것을 목표로 함
- 특히 Feed-Forward Transformer (FFT) block은 mel-spectrogram의 synthetic quality를 향상하는데 중요한 역할을 함
  - 대표적으로 FastSpeech, FastSpeech2와 같은 non-autoregressive FFT-based TTS model은 duration/pitch/energy와 같은 acoustic feature를 acoustic decoder에 적용하여 합성 품질을 개선함
- 한편으로 FastPitch의 경우 pitch value의 character-level을 변경하여 fine-grained prosody control이 가능함
  1. 특히 speaker characteristic을 preserve하면서 manipulated pitch로 음성을 생성하는 pitch shift를 지원할 수 있음
  2. BUT, FastPitch의 acoustic decoder는 text와 pitch information을 함께 처리하고 pitch-conditioned text information으로부터 음성을 생성함
    - 따라서 decoder는 text, pitch 간의 relationship을 학습하게 됨
  3. 결과적으로 FastPitch는 average pitch에서 벗어나는 경우, pitch expressiveness와 speaker similarity가 떨어짐
- 이때 text/prosodic information을 개별적으로 처리하기 위해 다음의 방법을 고려할 수 있음:
  1. Unsupervised manner로 training된 additional neural network를 활용하여 acoustic feature의 latent variable을 추출하는 방법
    - BUT, desired prosodic information이 latent variable에 포함되지 않을 수 있음
  2. Source-Filter theory를 활용하는 방법
    - Vocal tract filter에 의해 formulate된 sound source와 formant frequency는 각각 fundamental frequency와 phonation에 영향을 미침
    - BUT, 음성은 character 당 duration이 짧고 pitch 변화도 빈번하므로 modeling이 까다로움

-> 그래서 source-filter theory를 neural TTS에 접목한 FastPitchFormant를 제안

FastPitchFormant
- Decomposed structure를 통해 개별적으로 modeling된 formant-/excitation-related representation을 사용하여 mel-spectrogram을 생성
- Source-Filter theory와 non-autoregressive FFT model에 적합한 learning objective를 설계

< Overall of FastPitchFormant >

Source-Filter theory를 기반으로 한 non-autoregressive FFT-based TTS model
결과적으로 기존보다 뛰어난 합성 품질과 pitch controllability를 달성

2. Method

FastPitchFormant는 text encoder, temporal predictor, formant/excitation generator, spectrogram generator로 구성됨
- 여기서 temporal predictor를 제외한 나머지는 모두 Feed-Forward Transformer (FFT) block stack으로 구성됨
- Temporal predictor는 2개의 1D convolutional layer로 구성되어 ground-truth duration/pitch를 예측함
- Multi-speaker TTS의 경우 speaker embedding lookup table을 통해 speaker embedding을 얻음

- Text Encoder and Temporal Predictor

Phoneme embedding vector는 phoneme sequence와 positional embedding이 있는 lookup embedding table로 represent 됨
- 이후 phoneme embedding vector는 text encoder로 전달되어 hidden embedding을 예측함
  - 여기서 hidden embedding은 duration/pitch에 대한 2가지 temporal predictor에 대한 input으로 사용됨
- Pitch embedding은 predicted pitch value를 1D convolutional layer에 통과시켜 얻어지고,
  1. Hidden embedding과 pitch embedding은 각각 speaker embedding과 결합됨
  2. 이후 두 representation을 discretely upsample 하여 predicted duration과 align 함
- Upsampled phoneme representation을 $h \in R D \times T <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>h</mi><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>D</mi><mo>\times</mo><mi>T</mi></mrow></msup></math>$ , upsampled pitch representation을 $p \in R D \times T <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>D</mi><mo>\times</mo><mi>T</mi></mrow></msup></math>$ 라고 할 때, 각 $h, p <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>h</mi><mo>,</mo><mi>p</mi></math>$ 는 formant/excitation generator로 전달됨
  - $D <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>D</mi></math>$ : vector dimension, $T <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>T</mi></math>$ : total frame 수

- Formant and Excitation Generator

논문은 source-filter theory를 기반으로 formant/excitation generator를 도입함
- 먼저 formant generator는 $h <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>h</mi></math>$ 만을 사용하여 linguistic information과 같은 formant-related information을 포함한 formant representation을 예측함
- Excitation generator는 $h, p <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>h</mi><mo>,</mo><mi>p</mi></math>$ 를 모두 사용하여 prosody와 같은 excitation-related information을 포함하는 excitation representation을 예측함
  1. 특히 excitation representation이 $p <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi></math>$ 만을 사용하는 경우 pitch control accuracy가 저하됨
    - 따라서 pitch control accuracy를 개선하기 위해 self-attention mechanism을 도입함
  2. 결과적으로 excitation generator의 first self-attention layer에서 attention matrix와 query $Q <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>Q</mi></math>$ 는:
    (Eq. 1) $Attention(Q,K,V)=softmax(QKT√d)V<math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>Attention</mtext><mo stretchy="false">(</mo><mi>Q</mi><mo>,</mo><mi>K</mi><mo>,</mo><mi>V</mi><mo stretchy="false">)</mo><mo>=</mo><mtext>softmax</mtext><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mfrac><mrow><mi>Q</mi><msup><mi>K</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msup></mrow><msqrt><mi>d</mi></msqrt></mfrac><mo data-mjx-texclass="CLOSE">)</mo></mrow><mi>V</mi></math>$
    (Eq. 2) $Q = W Q (h + p) + b Q <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>Q</mi><mo>=</mo><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>Q</mi></mrow></msub><mo stretchy="false">(</mo><mi>h</mi><mo>+</mo><mi>p</mi><mo stretchy="false">)</mo><mo>+</mo><msub><mi>b</mi><mrow data-mjx-texclass="ORD"><mi>Q</mi></mrow></msub></math>$
    - $K, V <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi><mo>,</mo><mi>V</mi></math>$ : self-attention mechanism의 key, value
    - $W Q, b Q <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>Q</mi></mrow></msub><mo>,</mo><msub><mi>b</mi><mrow data-mjx-texclass="ORD"><mi>Q</mi></mrow></msub></math>$ : query에 대한 weight matrix, bias

- Spectrogram Decoder

Spectrogram decoder는 2개의 stacked FFT block과 3개의 Fully-Connected (FC) layer로 구성됨
- 여기서 각 FC layer는 target mel-spectrogram을 생성함
  1. First spectrogram은 first FC layer를 통해 project 된 formant와 excitation representation의 summation으로 얻어짐
  2. 이후 second/third mel-spectrogram을 생성하기 위해 formant, excitation representation의 summation이 stacked FFT block으로 전달되고, second/third FC layer에 의해 mel-spectrogram으로 project 됨
- 일반적으로 Source-Filter theory에서 source spectrum은 vocal tract filter로 multiply 되지만, FastPitchFormant는 log-scale mel-spectrogram을 사용하므로 multiplication을 summation으로 대체하여 사용함
- 결과적으로 각 FC layer output은 $L 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub></math>$ loss를 iterative loss로 포함하는 learning objective에 사용됨
  - 해당 iterative loss로 인해 spectrogram decoder는 formant와 excitation representation의 summation에서 final mel-spectrogram을 생성하도록 training 됨
- 추론 시에는 third FC layer의 mel-spectrogram을 FastPitchFormant의 final output으로 사용함

- Learning Objective

FastPitchFormant의 learning objective는:
(Eq. 3) $Lfinal=1TM∑3i=1Lspeci+αLp+βLd<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>f</mi><mi>i</mi><mi>n</mi><mi>a</mi><mi>l</mi></mrow></msub><mo>=</mo><mfrac><mn>1</mn><mrow><mi>T</mi><mi>M</mi></mrow></mfrac><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mn>3</mn></mrow></munderover><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>e</mi><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub></mrow></msub><mo>+</mo><mi>α</mi><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msub><mo>+</mo><mi>β</mi><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>d</mi></mrow></msub></math>$
- $M <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>M</mi></math>$ : mel-spectrogram bin 수
- $L s p e c i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>e</mi><msub><mi>c</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub></mrow></msub></math>$ : target, $i <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>i</mi></math>$ -th FC layer의 $i <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>i</mi></math>$ -th predicted mel-spectrogram 간의 $L 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub></math>$ loss
- $L p, L d <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>d</mi></mrow></msub></math>$ : 각각 target, predicted pitch/duration 간의 $L 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub></math>$ loss

3. Experiments

- Settings

Dataset : Korean Speaker Dataset
Comparisons : FastPitch

- Results

Objective Evaluation
- 먼저 $λ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>λ</mi></math>$ semitone shifted pitch value $f λ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>λ</mi></mrow></msub></math>$ 는 다음과 같이 계산할 수 있음:
  (Eq. 4) $fλ=2λ12×f0<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>λ</mi></mrow></msub><mo>=</mo><msup><mn>2</mn><mrow data-mjx-texclass="ORD"><mfrac><mi>λ</mi><mn>12</mn></mfrac></mrow></msup><mo>×</mo><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$
  - $f 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ : original pitch value
  - $λ \in {- 8, - 6, - 4, 0, 4, 6, 8} <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>λ</mi><mo>\in</mo><mo fence="false" stretchy="false">{</mo><mo>-</mo><mn>8</mn><mo>,</mo><mo>-</mo><mn>6</mn><mo>,</mo><mo>-</mo><mn>4</mn><mo>,</mo><mn>0</mn><mo>,</mo><mn>4</mn><mo>,</mo><mn>6</mn><mo>,</mo><mn>8</mn><mo fence="false" stretchy="false">}</mo></math>$
- Pitch control accuracy 측면에서 FastPitchFormant (FPF)는 FastPitch (FP) 보다 wider range의 pitch control이 가능함

$F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ Frame Error (FFE) 비교

Excitation, formant representation에 대한 mel-spectrogram을 비교해 보면, FPF의 formant, excitation generator는 vocal cord, vocal tract의 action을 modeling하는 것으로 나타남

(a) Excitation Representation (b) Formant Representation (c) Final Output

Mel-Cepstral Distortion (MCD) 측면에서도 FPF는 FP 보다 우수한 성능을 보임

$λ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>λ</mi></math>$ 에 대한 spectral envelope를 비교해보면, FPF는 original shape를 유지하지만 FP는 $λ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>λ</mi></math>$ 에 따라 distort 되는 것으로 나타남

Subjective Evaluation
- MOS 측면에서도 FPF는 우수한 합성 품질을 보임

Pitch-shifted speech에 대해서도 FP 보다 높은 MOS를 달성함

Speaker Preservation
- Pitch-shifted scale가 작은 경우 ( $| λ | \leq 4 <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">|</mo><mi>λ</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mo>\leq</mo><mn>4</mn></math>$ )에는 speaker similarity의 차이가 크지 않음
- $| λ | \geq 4 <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">|</mo><mi>λ</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mo>\geq</mo><mn>4</mn></math>$ 인 경우에는 FPF가 FP 보다 speaker characteristic을 효과적으로 preserve 함

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] Flowtron: An Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis (0)	2024.12.29
[Paper 리뷰] TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-Speech (0)	2024.12.22
[Paper 리뷰] DPP-TTS: Diversifying Prosodic Features of Speech via Determinantal Point Process (0)	2024.12.15
[Paper 리뷰] DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance (0)	2024.12.14
[Paper 리뷰] FlashSpeech: Efficient Zero-Shot Speech Synthesis (0)	2024.11.24

Let IT Begin Voice Engineer | 심심하면 앨범 리뷰 올립니다

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] FastPitchFormant: Source-Filter based Decomposed Modeling for Speech Synthesis

FastPitchFormant: Source-Filter based Decomposed Modeling for Speech Synthesis

1. Introduction

2. Method

- Text Encoder and Temporal Predictor

- Formant and Excitation Generator

- Spectrogram Decoder

- Learning Objective

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역