[Paper 리뷰] LEF-TTS: Lightweight and Efficient End-to-End Text-to-Speech Synthesis with Multi-Stream Generator

티스토리 뷰

Paper/TTS

[Paper 리뷰] LEF-TTS: Lightweight and Efficient End-to-End Text-to-Speech Synthesis with Multi-Stream Generator

feVeRin 2025. 4. 18. 17:41

LEF-TTS: Lightweight and Efficient End-to-End Text-to-Speech Synthesis with Multi-Stream Generator

최근에는 lightweight, efficient Text-to-Speech model의 요구가 증가하고 있음
LEF-TTS
- EfficientTTS2를 기반으로 Single Head Fast Linear Attention을 적용
- ConvWaveNet과 multi-stream iSTFT generator를 도입해 inference speed를 개선
논문 (ICASSP 2025) : Paper Link

1. Introduction

FastSpeech, FastSpeech2와 같은 two-stage TTS model에 비해 VITS와 같은 end-to-end Text-to-Speech (TTS) model은 text, speech 간의 relationship을 direclty modeling 하여 high quality speech를 생성함
- BUT, end-to-end TTS model은 상당한 parameter 수로 인해 inference speed의 한계가 있음
- 따라서 lightweight TTS는 speech quality를 유지하면서 inference efficiency를 향상하는 것을 목표로 함
  1. 대표적으로 Nix-TTS, Light-TTS, SpeedySpeech 등은 parameter 절감을 위해 Knowledge Distillation을 활용함
  2. EfficientSpeech의 경우 lightweight U-Net을 활용하고 FLY-TTS는 parameter sharing과 Vocos-like vocoder를 활용함

-> 그래서 더 적은 parameter 수와 더 빠른 inference speed의 lightweight TTS를 위한 LEF-TTS를 제안

LEF-TTS
- EfficientTTS2 framework를 기반으로 standard transformer 대신 Fast Linear Attention with Single Head (FLASH)를 도입
- Model parameter를 줄이기 위해 separable convolution과 Global Response Normalization (GRN)에 기반한 ConvWaveNet을 적용
- ConvNeXt-V2, iSTFT을 사용한 Multi-Stream Decoder를 통해 inference speed를 향상

< Overall of LEF-TTS >

EfficientTTS2를 기반으로 FLASH, ConvWaveNet, Multi-Stream Decoder를 도입한 lightweight TTS model
결과적으로 기존의 합성 품질을 유지하면서 더 빠른 inference speed와 parameter 절감을 달성

2. Method

- Fast Linear Attention

FLASH는 Gated Attention Unit (GAU)를 통해 self-attention을 reduce 함
- 여기서 논문은 context size에서 linear complexity를 가지는 layer variant를 얻기 위해 GAU에서 secondary attention을 approximate 함
- 이를 위해 token을 chunk로 group 한 다음, chunk 내에서 exact secondary attention과 chunk 간에 fast linear attention을 적용함
  1. 먼저 GAU는 Gated Linear Unit (GLU)의 top unified layer로 attention을 unite 하여 gated attention mechanism에 대한 computation을 share 함:
    (Eq. 1) $U=\phi_{u}(XW_{u}) \in\mathbb{R}^{T\times e}$
    $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, V=\phi_{v}(XW_{v}) \in\mathbb{R}^{T\times e}$
    $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, O=(U\odot AV)W_{o} \in\mathbb{R}^{T\times d}$
    - $X\in\mathbb{R}^{T\times e}$ : $T$ token에 대한 representation
    - $W_{u}\in \mathbb{R}^{T\times e},W_{v}\in\mathbb{R}^{T\times e}, W_{o}\in\mathbb{R}^{T\times d}$
    - $e$ : expanded intermediate size, $d$ : model size, $\phi$ : activation function, $\odot$ : element-wise multiplication
  2. 그러면 token-token attention matrix $A$는:
    (Eq. 2) $Z=\phi_{z}(XW_{z})\in\mathbb{R}^{T\times s}$
    $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, A=\text{relu}^{2}(\mathcal{Q}(Z)\mathcal{K}(Z)^{\top}+b)\in\mathbb{R}^{T\times T}$
    - $Z$ : intermediate shared representation, $W_{z}\in\mathbb{R}^{T\times z}$
    - $\mathcal{Q},\mathcal{K}$ : $Z$에 대한 per-dim scalar와 offset을 적용하는 transformation
    - $b$ : relative position bias
- 이를 통해 $A$가 identity matrix일 때 GLU를 reduce 할 수 있음

- ConvWaveNet

논문은 model parameter를 줄이기 위해 기존의 WaveNet을 ConvWaveNet으로 replace 함
- ConvWaveNet block은 depth-separable inflated convolution layer, tanh activation layer, sigmoid activation layer, GRN layer로 구성됨
  1. Residual concatenation은 final point-wise convolutional layer에 add 되어 block output을 생성함
  2. 여기서 GRN layer는 global feature aggregation, feature calibration을 수행하여 channel의 contrast, selectivity를 향상함
- 결과적으로 ConvWaveNet block의 process는:
  (Eq. 3) $z=\text{GRN}(\tanh (W_{f,k}* \mathbf{x})\odot \sigma(W_{g,k}*\mathbf{x}))$
  - $*$ : convolution operator, $\odot$ : element-wise multiplication, $\sigma$ : sigmoid function
  - $k$ : layer index, $f$ : filter $g$ : gate, $W$ : learnable convolution filter

- Multi-Stream Decoder with ConvNeXt-V2

EfficientTTS2는 HiFi-GAN을 decoder로 사용하므로 inference efficiency의 한계가 있음
- 한편으로 Vocos 등은 iSTFT와 ConvNeXt-V2 block을 통해 빠른 inference가 가능함
- 따라서 논문은 ConvNeXt-V2를 기반으로 하는 multi-stream decoder를 도입함
  1. 먼저 ConvNeXt-V2는 base framework로 사용되어 same temporal resolution으로 Fourier time-frequency coefficient를 생성하고 decomposed waveform에 대한 iSTFT synthesis를 수행함
  2. 이후 zero-padded upsampling과 trainable convolutional network를 통해 original waveform을 reconstruct 함
- 구조적으로 ConvNeXt-V2 block은 $7\times 7$ depth-separated convolutional layer와 2개의 $1\times 1$ 2-point convolution으로 구성됨
  1. 이때 GRN layer는 first point convolution 다음에 적용되고, final output은 residual structure를 connecting 하여 얻어지고, bottleneck feature는 GELU를 통해 activate 됨
  2. 한편으로 real-valued signal의 Fourier transform은 conjugate symmetric이므로, 각 frame의 coefficient를 $n_{fft}/2+1$로 설정함
- 결과적으로 논문은 hidden layer output을 $n_{fft}+2$ dimensional space에 project 하고 output을 split 함:
  (Eq. 4) $m,p=h\left[1:(n_{fft}/2+1)\right],h\left[(n_{fft}/2+2):n\right]$
  - $h$ : transformed hidden vector, $m,p$ : 각각 frame signal의 output amplitude, phase
- Multi-stream process는:
  1. ConvNeXt-V2 block의 hidden vector를 여러 sub-band space에 project 한 다음,
  2. iSTFT를 통해 각 sub-band space에서 amplitude, phase를 calculate 하고,
  3. 모든 signal result를 merge 하여 final speech waveform을 생성함

3. Experiments

- Settings

Dataset : LJSpeech
Comparisons : VITS, MB-iSTFT-VITS, EfficientTTS2

- Results

LEF-TTS는 EfficientTTS2에 비해 33.84%의 training parameter 절감과 89.74%의 FLOPS 절감 효과를 보임
- RTF 측면에서도 CPU에서 $7.50\times$, GPU에서 $3.16\times$의 가속이 가능함

Synthesis quality 측면에서도 baseline 수준의 성능을 유지함

Ablation Study
- 각 component를 제거하는 경우 성능 저하가 발생함

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] SSR-Speech: Towards Stable, Safe and Robust Zero-Shot Text-based Speech Editing and Synthesis (0)	2025.04.29
[Paper 리뷰] Evidential-TTS: High Fidelity Zero-Shot Text-to-Speech Using Evidential Deep Learning (0)	2025.04.23
[Paper 리뷰] Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting (0)	2025.04.15
[Paper 리뷰] DetailTTS: Learning Residual Detail Information for Zero-Shot Text-to-Speech (0)	2025.04.09
[Paper 리뷰] UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts (0)	2025.04.03

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] LEF-TTS: Lightweight and Efficient End-to-End Text-to-Speech Synthesis with Multi-Stream Generator

LEF-TTS: Lightweight and Efficient End-to-End Text-to-Speech Synthesis with Multi-Stream Generator

1. Introduction

2. Method

- Fast Linear Attention

- ConvWaveNet

- Multi-Stream Decoder with ConvNeXt-V2

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바