[Paper 리뷰] MB-iSTFT-VITS: Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform

티스토리 뷰

Paper/TTS

[Paper 리뷰] MB-iSTFT-VITS: Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform

feVeRin 2025. 5. 27. 17:49

MB-iSTFT-VITS: Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform

Lightweight end-to-end text-to-speech model이 필요함
MB-iSTFT-VITS
- Computationally expensive component를 simple inverse Short-Time Fourier Transform으로 replace
- Fixed/trainable synthesis filter를 가지는 multi-band generation을 통해 waveform을 생성
논문 (ICASSP 2023) : Paper Link

1. Introduction

Text-to-Speech (TTS) model은 상당한 parameter 수로 인해 limited computational resource를 가지는 real-world application에서 slow inference speed를 보임
- 따라서 TTS model은 synthesis quality를 preserve 하면서 fast inference를 달성할 수 있어야 함
  - BUT, LightSpeech, DeviceTTS는 acoustic model, vocoder 간의 separate optimization으로 인해 synthesis quality 측면에서 한계가 있음
- 한편으로 Nix-TTS, LiteTTS와 같이 end-to-end optimization을 활용하여 lightweight TTS를 구성할 수도 있음
  - BUT, teacher-student framework와 같은 complex design이 필요함

-> 그래서 high-fidelity, lightweight end-to-end TTS model인 MB-iSTFT-VITS를 제안

MB-iSTFT-VITS
- VITS decoder를 inverse Short-Time Fourier Transform (iSTFT) 기반의 computation으로 replace
- 추가적으로 iSTFT-based sample generation과 multi-band processing을 combine

< Overall of MB-iSTFT-VITS >

Multi-band generation과 iSTFT를 활용한 end-to-end TTS model
결과적으로 기존보다 빠른 추론 속도를 달성

2. Analysis on VITS

논문은 end-to-end TTS model인 VITS를 기반으로 함
- VITS는 text-conditional prior distribution을 가지는 Variational AutoEncoder (VAE)로 구성됨
- 그러면 model은 text $c$가 주어졌을 때 waveform $x$의 log-likelihood를 maximize 하도록 training 됨
  1. BUT, 해당 maximization은 intractable 하므로 대신 Evidence Lower BOund (ELBO)를 maximize 함:
    (Eq. 1) $\log p_{\theta}(x|c)\geq \mathbb{E}_{q_{\theta}(z|x)}\left[\log p_{\theta}(x|z)-\log \frac{q_{\phi}(z|x)}{p_{\theta}(z|c)}\right]$
    - $z$ : VAE의 latent variable, $p,q$ : 각각 true distribution, approximate posterior distribution
    - $\theta, \phi$ : 각각 $p,q$의 model parameter
  2. Loss는 negative ELBO로 정의되므로 (Eq. 1)의 first term은 approximate posterior distribution $q_{\phi}(z|x)$에서 sampling 된 $z$가 주어졌을 때 waveform $x$의 reconstruction loss로 볼 수 있음
    - Second term은 posterior, prior distribution 간의 Kullback-Leibler divergence와 같음
- 추론 시 $z$는 $q_{\phi}(z|x)$ 대신 prior $p_{\theta}(z|c)$에서 sampling 된 다음, VAE decoder에 전달되어 waveform을 생성함
  - $p_{\theta}(z|c), q_{\phi}(z|x),p_{\theta}(x|z)$를 modeling 하는 network는 각각 prior encoder, posterior encoder, decoder에 해당함
- 한편으로 VITS의 inference speed 측면에서, decoder는 아래 표와 같이 inference time의 $96\%$를 차지함

3. Method

- Motivation and Strategy

Decoder module은 VITS의 largest bottleneck으로 작용함
- 특히 decoder architecture는 input acoustic feature를 repeated convolution-based network인 HiFi-GAN vocoder를 기반으로 사용하므로, 해당 module의 redundancy를 reduce 하는 것을 목표로 함
- 이를 위해 논문은 iSTFTNet과 같이 output-side layer를 simple iSTFT로 replace 하여 computational cost를 reduce 함
  - 즉, mel-spectrogram에 대한 neural vocoding process를 phase reconstruction과 frequency-to-time conversion을 simultaneously performing 하는 iSTFT로 replace 함
- 추가적으로 generation speed를 더욱 향상하기 위해 논문은 iSTFT-based approach와 multi-band parallel strategy를 combine 함

- Multi-Band iSTFT VITS

Multi-band parallel strategy는 sub-band signal을 생성할 때 single shared network를 활용해 synthesis quality를 maintain 하면서 computational cost를 절감할 수 있음
- 이때 decoder는 다음의 process를 sequential manner로 수행함:
  1. 먼저 VAE latent $z$는 convolutional residual block (ResBlock)을 통해 factor $s$로 upsample 되고, 각 $N$ sub-band signal의 magnitude, phase로 project 됨
    - $s$ : upsampling scale parameter
  2. 이후 iSTFT operation은 각 sub-band signal을 생성하기 위해 magnitude, phase variable에 적용됨
  3. 해당 sub-band signal은 original signal의 sampling rate와 match 되기 위해 upsample 되고, fixed synthesis filter bank를 사용하여 full-band waveform으로 integrate 됨
    - Synthesis filter는 pseudo-quadrature mirror filter bank (Pseudo-QMF)를 사용함
- Training 시 VITS의 reconstruction loss는 sub-band scale에 대한 additional multi-resolution STFT loss를 포함하도록 modify 됨
  - 이때 input waveform에서 sub-band STFT loss를 compute 하는데 필요한 ground-truth sub-band signal을 생성하기 위해 pseudo-QMF에 기반한 analysis filter를 활용함
- 결과적으로 얻어지는 Multi-Band iSTFT VITS (MB-iSTFT-VITS)는 fully end-to-end manner로 optimize 되어 더 나은 audio quality를 달성할 수 있음

- Multi-Stream iSTFT VITS

Multi-band structure는 fast inference가 가능하지만 sub-band signal로의 fixed decomposition은 inflexible constraint로 인해 waveform generation에 adversely affect 할 수 있음
- 따라서 논문은 multi-band structure에서 trainable synthesis filter를 고려함
  - 이를 통해 model은 data-driven manner로 speech waveform을 decompose 할 수 있으므로 synthesis quality를 향상할 수 있음
- 결과적으로 Multi-Stream iSTFT-VITS (MS-iSTFT-VITS)는 decomposed waveform이 fixed sub-band signal에 restrict 되지 않고 fully trainable 하므로, 앞선 MB-iSTFT-VITS와 달리 sub-band STFT loss가 필요하지 않음

3. Experiments

- Settings

Dataset : LJSpeech
Comparisons : VITS, Nix-TTS

- Results

MB-iSTFT-VITS는 기존 VITS 수준의 MOS를 유지하면서 $1.8\times$의 speed up을 달성함

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] LiveSpeech: Low-Latency Zero-Shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes (0)	2025.05.29
[Paper 리뷰] ZCS-CDiff: A Zero-Shot Code-Switching TTS System with Conformer-Based Diffusion Model (0)	2025.05.28
[Paper 리뷰] E3-VITS: Emotional End-to-End TTS with Cross-Speaker Style Transfer (0)	2025.05.23
[Paper 리뷰] InstantSpeech: Instant Synchronous Text-to-Speech Synthesis for LLM-driven Voice Chatbots (0)	2025.05.20
[Paper 리뷰] DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech (0)	2025.05.15

최근에 올라온 글

최근에 달린 댓글

« 2025/09 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] MB-iSTFT-VITS: Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform

MB-iSTFT-VITS: Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform

1. Introduction

2. Analysis on VITS

3. Method

- Motivation and Strategy

- Multi-Band iSTFT VITS

- Multi-Stream iSTFT VITS

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바