[Paper 리뷰] SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow

티스토리 뷰

Paper/TTS

[Paper 리뷰] SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow

feVeRin 2025. 3. 25. 20:49

SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow

Flow matching-based speech synthesis model은 inference step을 줄이면서 speech quality를 향상할 수 있음
SlimSpeech
- Rectified flow model을 기반으로 parameter 수를 줄이고 teacher model로 활용
- Reflow operation을 refine 하여 straight sampling trajectory를 가지는 smaller model을 directly derive 하고 distillation method를 통해 성능을 향상
논문 (ICASSP 2025) : Paper Link

1. Introduction

Text-to-Speech (TTS) system은 대부분 acoustic model이 text를 acoustic feature로 변환한 다음, vocoder를 통해 speech waveform을 생성하는 2-stage generation approach를 따름
- 해당 TTS model에서 synthesized speech의 quality는 대부분 acoustic model에 의해 결정됨
  1. 특히 Diffusion Probabilistic Model (DPM)을 활용하면 high-quality acoustic feature를 생성할 수 있음
  2. BUT, DPM은 high-quality sample을 생성하기 위해 상당한 sampling step이 필요함
    - 따라서 기존의 ProDiff, LightGrad, DiffGAN-TTS, CoMoSpeech 등은 sampling step 수를 줄이는데 집중함
- 한편으로 standard Gaussian distribution에서 real data distribution으로의 Ordinary Differential Equation (ODE)를 directly learning 하는 Flow Matching을 고려할 수도 있음
  1. 대표적으로 VoiceBox, Matcha-TTS, ReFlow-TTS는 flow matching을 통해 DPM보다 더 적은 step으로 high quality의 speech를 생성함
  2. BUT, flow matching은 model parameter size 측면에서 한계가 있음

-> 그래서 rectified flow framework를 활용하여 TTS model의 parameter size/inference step을 절감한 SlimSpeech를 제안

SlimSpeech
- Varying parameter 하에서 sampling trajectory를 straighten 하여 sampling efficiency를 향상할 수 있는 Annealing Reflow를 도입
- 추가적으로 Flow-Guided Distillation method를 integrate하여 sample quality를 개선하고 depthwise separable convolution을 encoder에 적용해 parameter 수를 minimze

< Overall of SlimSpeech >

Rectified flow와 flow-guided distillation을 활용한 TTS model
결과적으로 더 적은 sampling step 만으로도 기존보다 뛰어난 합성 성능을 달성

2. Background

Generative modeling은 prior distribution에서 data distribution으로의 mapping을 discover 하는 것을 목표로 함
- 여기서 rectified flow model은 Ordinary Differential Equation (ODE)를 활용하여 straight path를 따라 desired data distribution을 생성하는 continuous dynamical system을 구성함
  - 결과적으로 high-quality result를 얻기 위해서는 single step computation만 필요함
- 먼저 initial prior distribution $\pi_{1}$과 target data distribution $\pi_{0}$이 주어졌을 때, 다음의 ODE를 얻을 수 있음:
  (Eq. 1) $d\mathbf{x}_{t}=v_{\theta}(\mathbf{x}_{t},t)dt$
  - $t\in(0,1)$, $v_{\theta}$ : vector field
- Rectified flow는 neural network $\theta$로 parameterize 된 vector field를 다음의 objective로 train 함:
  (Eq. 2) $\mathcal{L}_{rf}(\theta)=\mathbb{E}_{\mathbf{x}_{1}\sim\pi_{1},\mathbf{x}_{0}\sim\pi_{0}}\left[\int_{0}^{1}|| v_{\theta}(\mathbf{x}_{t},t)-(\mathbf{x}_{1}-\mathbf{x}_{0})||^{2}dt\right]$
  - $\mathbf{x}_{t}=t\mathbf{x}_{1}+(1-t)\mathbf{x}_{0}$

- Reflow

ODE model trajectory는 curve 할 수 있으므로 direct probabilistic flow와 one-step generation을 위해서는 ODE trajectory를 straighten 해야 함
- 따라서 rectified flow는 reflow method를 도입함:
  (Eq. 3) $\mathcal{L}_{Reflow}(\phi)=\mathbb{E}_{\mathbf{x}_{1}\sim\pi_{1}}\left[\int_{0}^{1}|| v_{\phi}(\mathbf{x}_{t},t)-(\mathbf{x}_{1}-\hat{\mathbf{x}}_{0})||^{2}dt\right]$
  - $\hat{\mathbf{x}}_{0}$ : initial noise $\mathbf{x}_{1}$로 부터 (Eq. 1)의 ODE를 사용하는 pre-trained probabilistic flow model $v_{\theta}$를 통해 generate 된 data
- $v_{\theta}$의 ODE trajectory (1-rectified flow)로부터 얻은 data를 사용하여 training 하면 straight ODE trajectory를 가지는 $v_{\phi}$ (2-rectified flow)를 얻을 수 있음
  - 이를 통해 sampling efficiency를 향상할 수 있음

- Distillation

One-step generation의 성능을 향상하기 위해,
- Rectified flow framework에서 distillation을 활용할 수 있음:
  (Eq. 4) $\mathcal{L}_{Distill}(\phi')=\mathbb{E}_{\mathbf{x}_{1}\sim\pi_{1}}\left[\mathbb{D}(\text{ODE}[v_{\phi}](\mathbf{x}_{1}),v_{\phi'}(\mathbf{x}_{1},1))\right]$
  - $\mathbb{D}(\cdot, \cdot)$ : difference calculating function
- 특히 reflow를 통해 direct probabilistic flow model을 얻어 better data pair를 생성한 다음, distillation에 활용하는 방식으로 combine 할 수 있음

3. Method

- Rectified Flow based Teacher Model

먼저 논문은 1-rectified flow와 같은 rectified flow model을 기반으로 large teacher model을 training 함
- 구조적으로는 text encoder, duration predictor, length regulator, rectified flow decoder로 구성된 ReFlow-TTS의 parameter-reduced version을 사용함
  1. Duration predictor, length regulator는 FastSpeech2를 따름
  2. Text encoder는 depthwise-separable convolution을 사용하고 lightweighting을 위해 224 channel dimension을 사용함
  3. Rectified flow decoder는 256 channel dimension을 가진 20-stacked residual block으로 구성된 DiffWave-like architecture를 따름
    - 이때 step embedding을 얻기 위해 Sinusoidal positional emmbedding을 활용함
- $\pi_{1}$이 standard Gaussian distribution, $\pi_{0}$이 mel-spectrogram의 true distribution을 represent 한다고 할 때, teacher model의 training loss는:
  (Eq. 5) $\mathcal{L}_{rf}(\theta)=\mathbb{E}_{\mathbf{x}_{1}\sim\pi_{1},\mathbf{x}_{0}\sim\pi_{0}}\left[\int_{0}^{1}|| v_{\theta}(\mathbf{x}_{t},t,c)-(\mathbf{x}_{1}-\mathbf{x}_{0}) ||^{2}dt\right]$
  (Eq. 6) $\mathcal{L}_{all}(\theta)=\mathcal{L}_{rf}(\theta)+\mathcal{L}_{dur}(\theta)$
  - $c$ : text embedding

- SlimFlow for TTS

논문은 annealing reflow와 flow-guided distillation에 기반한 SlimFlow를 사용하여 one-step text-to-speech student model을 training 함
- 이때 entire model을 training 하지 않고, decoder를 제외한 나머지 module은 teacher model에서 keeping 하고 parameter를 freezing 하면서 smaller parameter decoder를 directly training 함
Annealing Reflow
- Reflow stage는 straighter sampling trajectory를 가진 probabilistic flow를 train 하여 sampling step을 줄이고 efficiency를 향상할 수 있음
  - BUT, model parameter 수는 고려하지 않음
- 따라서 논문은 Annealing Reflow를 통해 straighter trajectory를 가지는 smaller student model을 directly training 하여 teacher/student model의 initialization 간의 parameter mismatch 문제를 해결함
  1. 즉, 1-rectified flow training에서 2-rectified flow training으로 smoothly transition 하여 model training process를 accelerate 함
  2. 그러면 annealing reflow의 objective는:
    (Eq. 7) $\mathcal{L}_{a\text{-reflow}}^{k}(\phi)=\mathbb{E}_{\mathbf{x}_{1},\mathbf{x}_{1}^{'}\sim\pi_{1}} \left[\int_{0}^{1}|| v_{\phi}(\mathbf{x}_{t}^{\beta(k)},t,c)-(\mathbf{x}_{1}^{\beta(k)}-\hat{\mathbf{x}}_{0})||_{2}^{2}dt\right]$
    - $\mathbf{x}_{t}^{\beta(k)}=(1-t)\hat{\mathbf{x}}_{0}+t\mathbf{x}_{1}^{\beta(k)}$
    - $\mathbf{x}_{1}^{\beta(k)}=\left(\sqrt{1-\beta^{2}(k)}\mathbf{x}_{1}+\beta(k)\mathbf{x}'_{1}\right)$
    - $\hat{\mathbf{x}}_{0}=\text{ODE}[v_{\theta}](\mathbf{x}_{1})=\mathbf{x}_{1}+\int_{0}^{1}v_{\theta}(\mathbf{x}_{t},t,c)dt$
    - $k$ : training iterations, $(\mathbf{x}_{1},\hat{\mathbf{x}}_{0},c)$ : pre-trained teacher model에서 생성된 data pair
  3. $\beta(k)$는 다음과 같이 정의됨:
    (Eq. 8) $\beta(k)=1-\min(1,k/K_{a\text{-step}})$
    - $K_{a\text{-step}}$ : constant
- Training이 진행됨에 따라 training data는 random data pair에서 pre-trained 1-rectified flow model로 생성된 data pair로 gradually shift 함
  - 이를 통해 student model initialization을 보장하고 smaller 2-rectified flow model을 directly output 할 수 있음
Flow-Guided Distillation
- Naive distillation을 적용하는 경우 student model의 limited capacity로 인해 suboptimal result를 얻을 수 있음
- 따라서 dataset size를 maintain 하면서 student model의 one-step generation capability를 향상하기 위해 Flow-Guided Distillation을 도입함
  1. 이때 direct distillation 외에도 few-step generation을 regularization term으로 사용하는 additional 2-rectified flow를 활용함
  2. 즉, 다음의 two-step generation distillation loss를 사용함:
    (Eq. 9) $\mathcal{L}_{2\text{-step}}(\phi')=\mathbb{E}_{\mathbf{x}_{1}\sim\pi_{1}}\left[\int_{0}^{1}\mathcal{D}(\mathbf{x}_{1}-(1-t)v_{\phi}(\mathbf{x}_{1},1,c)-tv_{\phi}(\mathbf{x}_{t},t,c),\mathbf{x}_{1}-v_{\phi'}(\mathbf{x}_{1},1,c))dt\right]$
    - $\mathcal{D}$ : $L2$ loss
  3. 그러면 total loss는:
    (Eq. 10) $\mathcal{L}_{FG\text{-}Distill}=\mathcal{L}_{Distill}(\phi')+\mathcal{L}_{2\text{-step}}(\phi')$
    (Eq. 11) $\mathcal{L}_{Distill}(\phi')=\mathbb{E}_{\mathbf{x}_{1}\sim\pi_{1}}\left[|| (\text{ODE}[v_{\phi}](\mathbf{x}_{1},c),v_{\phi'}(\mathbf{x}_{1},1,c))||^{2}\right]$

4. Experiments

- Settings

Dataset : LJSpeech
Comparisons : Grad-TTS, Matcha-TTS, ReFlow-TTS, FastSpeech2

- Results

전체적으로 SlimSpeech의 성능이 가장 뛰어남

Ablation Study
- Annealing reflow와 flow-guided distillation을 제거하는 경우 성능 저하가 발생함

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance (0)	2025.04.02
[Paper 리뷰] NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers (0)	2025.03.26
[Paper 리뷰] Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization (0)	2025.03.17
[Paper 리뷰] DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors (0)	2025.03.03
[Paper 리뷰] BnTTS: Few-Shot Speaker Adaptation in Low-Resource Setting (1)	2025.02.16

최근에 올라온 글

최근에 달린 댓글

« 2025/10 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow

SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow

1. Introduction

2. Background

- Reflow

- Distillation

3. Method

- Rectified Flow based Teacher Model

- SlimFlow for TTS

4. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바