[Paper 리뷰] Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow

티스토리 뷰

Paper/TTS

[Paper 리뷰] Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow

feVeRin 2024. 2. 6. 11:29

Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow

Non-autoregressive Text-to-Speech를 위해 generative flow를 활용할 수 있음
Flow-TTS
- Single feed-forward network 만을 사용하여 고품질의 음성을 합성
- Spectrum 생성을 위해 flow를 활용하고 single network를 통해 alignment와 spectrogram 생성을 jointly learn
논문 (ICASSP 2020) : Paper Link

1. Introduction

Text-to-Speech (TTS)는 input text sequence ${x 1, x 2, . . ., x N} <math xmlns="http://www.w3.org/1998/Math/MathML"><mo fence="false" stretchy="false">{</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>,</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>N</mi></mrow></msub><mo fence="false" stretchy="false">}</mo></math>$ 로부터 output acoustic sequence ${y 1, y 2, . . ., y T} <math xmlns="http://www.w3.org/1998/Math/MathML"><mo fence="false" stretchy="false">{</mo><msub><mi>y</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>,</mo><msub><mi>y</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>y</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub><mo fence="false" stretchy="false">}</mo></math>$ 를 생성함
- Concatenative TTS와 statistical TTS는 pipeline이 복잡하고 부자연스러운 음성을 생성하는 단점이 있음
- End-to-End TTS는 이러한 한계를 극복할 수 있고, 두 부분으로 구성됨
  1. Normalized text symbol을 mel-spectrogram과 같은 time-aligned feature로 변환하는 spectrogram generation network
  2. Time-aligned feature를 audio로 변환하는 vocoder
- 논문에서는 spectrogram generation network 구성에 집중하고 vocoder로써는 WaveGlow를 채택

이때 spectrogram generation network는 아래와 같이 2가지 범주로 분류할 수 있음
1. Autoregressive model
  - 높은 음성 품질을 달성할 수 있지만, 느린 decoding 속도를 가짐
  - Teacher forcing technique을 활용하여 학습 과정을 개선할 수 있지만, 예측 분포와 실제 data 분포 간의 mismatch가 발생할 수 있음
2. Non-autoregressive model
  - Autoregressive model에 비해서는 추론 속도를 크게 개선할 수 있지만, text sequence와 spectrogram sequence 사이의 alignment를 학습하는 것이 어려움
  - 결과적으로 well-trained autoregressive teacher를 통한 guide가 필요하기 때문에 학습 과정이 복잡해짐

-> 그래서 non-autoregressive TTS의 한계를 극복하기 위해 generative flow를 활용한 Flow-TTS를 제안

Flow-TTS
- Generative flow (Glow)를 활용하여 효율적인 density 추정과 sampling을 지원
- Teacher model에 기반한 parameter distillation을 사용하지 않고 single feed-forward network만을 활용
- Single feed-forward network를 통해 alignment, spectrogram generation을 jointly learning

< Overall of Flow-TTS >

Generative flow를 TTS에 활용하는 최초의 시도
Single feed-forward network를 통해 alignment와 spectrogram generation을 jointly learn
결과적으로 기존 autoregressive model 보다 우수한 합성 품질을 달성

2. Flow-Baed Generative Model

Flow-based generative model은 invertible transform의 sequence를 적용하여 Guassian 분포와 같은 단순한 probability density를 복잡한 density로 변환하는 것
- Random variable $z <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow></math>$ 와 known probability density function $z \sim π (z) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mo>\sim</mo><mi>π</mi><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mo stretchy="false">)</mo></math>$ 가 주어졌을 때,
  - Flow-based generative model은 transformation function의 sequence $f = f 1 \circ f 2 \circ . . . \circ f L <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo>=</mo><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>\circ</mo><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub><mo>\circ</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>\circ</mo><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>L</mi></mrow></msub></math>$ 을 사용하여 $z <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow></math>$ 를 새로운 random variable $y <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow></math>$ 에 mapping
  - 이때 $y <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow></math>$ 는 동일한 dimension을 가지고, 각 $f i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub></math>$ 는 invertible 함
- 여기서 $y <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow></math>$ 의 probability density function은 variable transformation을 통해 계산됨:
  (Eq. 1) $logpY(y)=logπ(z)+∑Li=1log|det∂fi∂fi−1|<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>Y</mi></mrow></msub><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow><mo stretchy="false">)</mo><mo>=</mo><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><mi>π</mi><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mo stretchy="false">)</mo><mo>+</mo><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>L</mi></mrow></munderover><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">|</mo><mo data-mjx-texclass="OP" movablelimits="true">det</mo><mfrac><mrow><mi>∂</mi><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub></mrow><mrow><mi>∂</mi><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>−</mo><mn>1</mn></mrow></msub></mrow></mfrac><mo data-mjx-texclass="CLOSE">|</mo></mrow></math>$
  - $det∂fi∂fi−1<math xmlns="http://www.w3.org/1998/Math/MathML"><mo data-mjx-texclass="OP" movablelimits="true">det</mo><mfrac><mrow><mi>∂</mi><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub></mrow><mrow><mi>∂</mi><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>−</mo><mn>1</mn></mrow></msub></mrow></mfrac></math>$ : $f i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub></math>$ 의 Jacobian determinant
- 이때 효율적인 계산을 위해 flow-based generative model은 triangular matrix를 Jacobian transformation으로 사용하고, $π (z) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>π</mi><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mo stretchy="false">)</mo></math>$ 로써 isotropic Gaussian을 사용
  - 이를 통해 flow-based generative model은 (Eq. 1)을 최대화함으로써 high-dimensional data를 모델링함

3. Flow-TTS

Flow-TTS는 generative flow (Glow)를 기반으로 함
- 전체 architecture는 Encoder, Decoder, Length predictor, Positional attention layer로 구성됨

- Encoder

Encoder는 text symbol을 trainable embedding으로 변환한 다음, convolution block을 적용함
- 각 convolution block은 1D convolution layer, ReLU activation, Batch Normalization, Dropout으로 구성됨
- 추가적으로 long-range textual inforamtion을 추출하기 위해 encoder 끝에 LSTM layer가 사용됨
  - Text length는 output spectrogram length 보다 훨씬 짧으므로 LSTM은 추론 속도에는 영향을 미치지 않으면서 model의 수렴성을 크게 개선할 수 있음

- Length Predictor

Length predictor는 output spectrogram sequence의 length를 예측하는 데 사용됨
- Autoregressive model은 special stop token을 사용하여 length를 예측할 수 있지만, Flow-TTS는 output frame을 병렬로 예측하기 때문에 sequence length를 미리 예측해야 함
- 구조적으로는 2-layer 1D convolution network로 구성되고 각 network는 Layer Normalization, Dropout을 포함
- 추가적으로 length predictor의 끝에는 accumulated layer가 사용되어 모든 symbol duration을 final length까지 accumulate 함
  - Length predictor는 encoder 다음에 위치하여 안정적인 학습을 위해 logarithmic doamin에서 length를 예측

- Positional Attention

Positional attention은 input text sequence와 output spectrogram sequence 간의 alignment를 학습
- 이를 위해 multi-head dot-product attention mechanism을 적용
  - Encoder의 output hidden state를 key, value로, spectrogram length의 positional encoding을 query로 사용
- 학습 과정에서 spectrogram length는 ground-truth spectrogram으로부터 얻어지고 추론 시에는 length predictor에 의해 예측됨

- Decoder

Decoder는 multi-scale architecture와 일련의 flow step들로 구성된 Glow architecture를 활용
- 각 flow step은 2개의 invertible transformation layer, invertible $1 \times 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>1</mn><mo>\times</mo><mn>1</mn></math>$ convolution, affine coupling layer로 구성
- Positional attention layer에서 생성된 condition을 flow에 공급하는 coupling block을 활용
  - 이러한 squeeze operation을 위해 8의 group size를 사용하여 spectrogram을 grouping 함
Affine Coupling Layer
- Affine coupling layer는 forward/reverse가 모두 계산 효율적이고 log-determinant를 가지는 invertible transformation으로:
  (Eq. 2) $z a, z b = s p l i t (z) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mrow data-mjx-texclass="ORD"><mi>a</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mrow data-mjx-texclass="ORD"><mi>b</mi></mrow></msub><mo>=</mo><mi>s</mi><mi>p</mi><mi>l</mi><mi>i</mi><mi>t</mi><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mo stretchy="false">)</mo></math>$
  (Eq. 3) $(log s, t) = N N (z b) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">t</mi></mrow><mo stretchy="false">)</mo><mo>=</mo><mi>N</mi><mi>N</mi><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mrow data-mjx-texclass="ORD"><mi>b</mi></mrow></msub><mo stretchy="false">)</mo></math>$
  (Eq. 4) $s = exp (log s) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow><mo>=</mo><mi>exp</mi><mo data-mjx-texclass="NONE"></mo><mo stretchy="false">(</mo><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow><mo stretchy="false">)</mo></math>$
  (Eq. 5) $y a = s \cdot z a + t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>a</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">s</mi></mrow><mo>\cdot</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mrow data-mjx-texclass="ORD"><mi>a</mi></mrow></msub><mo>+</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">t</mi></mrow></math>$
  (Eq. 6) $y b = z b <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>b</mi></mrow></msub><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mrow data-mjx-texclass="ORD"><mi>b</mi></mrow></msub></math>$
  (Eq. 7) $y = c o n c a t (y a, y b) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow><mo>=</mo><mi>c</mi><mi>o</mi><mi>n</mi><mi>c</mi><mi>a</mi><mi>t</mi><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>a</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow><mrow data-mjx-texclass="ORD"><mi>b</mi></mrow></msub><mo stretchy="false">)</mo></math>$
  - $s p l i t () <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>s</mi><mi>p</mi><mi>l</mi><mi>i</mi><mi>t</mi><mo stretchy="false">(</mo><mo stretchy="false">)</mo></math>$ : input tensor를 절반으로 split, $c o n c a t () <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>c</mi><mi>o</mi><mi>n</mi><mi>c</mi><mi>a</mi><mi>t</mi><mo stretchy="false">(</mo><mo stretchy="false">)</mo></math>$ : output tensor concatenation
  - $N N () <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi><mi>N</mi><mo stretchy="false">(</mo><mo stretchy="false">)</mo></math>$ : non-linear transformation
Invertible $1 \times 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>1</mn><mo>\times</mo><mn>1</mn></math>$ Convolutional Layer
- Channel ordering을 permute 하기 위해 $1 \times 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>1</mn><mo>\times</mo><mn>1</mn></math>$ invertible convolution layer가 affine coupling layer 앞에 적용됨
- 이때 weight matrix를 log-determinant가 0인 random orthogonal matrix로 initialize
Multi-Scale Architecture
- Multi-scale architecture는 deep flow step을 학습하는데 유용함
  - Flow-TTS는 4 step flow를 활용
- 각 scale 후에 tensor의 일부 channel이 flow step에서 drop 되고, 모든 flow step이 지난 다음 한 번에 concatenate 됨
Coupling Block
- Coupling block은 $N N () <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi><mi>N</mi><mo stretchy="false">(</mo><mo stretchy="false">)</mo></math>$ transformation의 역할을 수행함
- Coupling block은 kernel size 1의 1D convolution layer와 Gated Tanh Unit (GTU) layer로 구성됨:
  (Eq. 8) $z = tanh (W f, k * y) ⊙ σ (W g, k * c) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mo>=</mo><mi>tanh</mi><mo data-mjx-texclass="NONE"></mo><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">W</mi></mrow><mrow data-mjx-texclass="ORD"><mi>f</mi><mo>,</mo><mi>k</mi></mrow></msub><mo>*</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">y</mi></mrow><mo stretchy="false">)</mo><mo>⊙</mo><mi>σ</mi><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">W</mi></mrow><mrow data-mjx-texclass="ORD"><mi>g</mi><mo>,</mo><mi>k</mi></mrow></msub><mo>*</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">c</mi></mrow><mo stretchy="false">)</mo></math>$
  - $k <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi></math>$ : layer index, $f <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi></math>$ : filter, $g <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>g</mi></math>$ : gate, $c <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>c</mi></math>$ : attention context vector
  - $W <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">W</mi></mrow></math>$ : 1D convolution layer
- Deep network 구성을 위해 GTU layer에는 residual connection이 사용됨
  - 이후 channel size를 맞추기 위해 kernel size가 1인 1D convolution layer가 끝에 추가됨
  - 마지막 convolution layer의 weigh는 0으로 initialize 되어 각 affine coupling layer가 초기에 identity function으로 동작하도록 함

3. Experiments

- Settings

Dataset : LJSpeech
Comparisons : FastSpeech, Tacotron2

- Results

MOS 측면에서 합성 품질을 비교해 보면, Flow-TTS가 가장 우수한 성능을 보이는 것으로 나타남

Mel-Cepstral Distortion (MCD) 측면에서 정량적인 평가를 수행해 보면, 마찬가지로 Flow-TTS가 가장 우수한 성능을 기록

합성된 mel-spectrogram을 ground-truth와 비교해 보면Flow-TTS로 합성된 음성의 prosody는 ground-truth와 유사하게 나타남

F0 trajectory 측면에서도 Flow-TTS는 ground-truth와 가장 비슷한 결과를 보임

추론 속도 측면에서 Flow-TTS의 latency는 0.021초인 반면 Tacotron2는 0.483초로, 23배의 가속 효과를 얻을 수 있음
- FastSpeech의 0.025초와 비교했을 때도 Flow-TTS가 근소하게 더 우수한 추론 속도를 보임

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture (0)	2024.02.10
[Paper 리뷰] Grad-StyleSpeech: Any-Speaker Adaptive Text-to-Speech Synthesis with Diffusion Models (0)	2024.02.09
[Paper 리뷰] YourTTS: Toward Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone (0)	2024.02.05
[Paper 리뷰] STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech (0)	2024.01.31
[Paper 리뷰] GenerSpeech: Toward Style Transfer for Generalizable Out-of-Domain Text-to-Speech (0)	2024.01.30

Let IT Begin Voice Engineer | 심심하면 앨범 리뷰 올립니다

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow

Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow

1. Introduction

2. Flow-Baed Generative Model

3. Flow-TTS

- Encoder

- Length Predictor

- Positional Attention

- Decoder

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역