[Paper 리뷰] Int-MeanFlow: Few-Step Speech Generation with Integral Velocity Distillation

티스토리 뷰

Paper/TTS

[Paper 리뷰] Int-MeanFlow: Few-Step Speech Generation with Integral Velocity Distillation

feVeRin 2026. 5. 15. 12:50

Int-MeanFlow: Few-Step Speech Generation with Integral Velocity Distillation

Flow-based model은 iterative sampling으로 인한 추론 속도의 한계가 있음
Int-MeanFlow
- Average velocity를 temporal interval 동안 teacher의 instantaneous velocity로 approximate
- 추가적으로 Optimal Step Sampling Search를 도입하여 model-specific optimal sampling step을 identify
논문 (ICASSP 2026) : Paper Link

1. Introduction

Text-to-Speech (TTS)에서 flow-based model은 iterative sampling으로 인해 추론 속도의 한계가 있음
- 이때 MeanFlow를 활용하면 Number of Function Evalutation (NFE)를 줄이면서 sampling quality를 향상할 수 있음
- BUT, MeanFlow를 TTS에 적용하기 위해서는 다음을 고려해야 함:
  1. MeanFlow의 training process는 self-bootstrap mechanism에 기반하고, flow matching과 유사한 instantaneous velocity guidance mixing이 필요함
    - 특히 guidance strength는 model 성능에 큰 영향을 미침
  2. MeanFlow는 상당한 GPU memory를 소비하는 Jacobian-Vector Product를 사용함
    - 즉, Memory 한계로 인해 large-scale TTS model을 training 하기 어려움

-> 그래서 TTS task를 위해 MeanFlow의 한계점을 개선한 Int-MeanFlow를 제안

Int-MeanFlow
- MeanFlow framework를 기반으로 instantaneous velocity 대신 averaged velocity를 학습하고, model-specific near-optimal sampling step을 identity 하는 Optimal Step Sampling Search (OS3)를 도입
- 추가적으로 pre-trained flow matching model에 적용할 수 있는 initialization strategy를 구성

< Overall of Int-MeanFlow >

Averaged velocity 학습과 OS3 algorithm에 기반한 MeanFlow-based TTS framework
결과적으로 기존보다 우수한 성능을 달성

2. Method

- Int-MeanFlow: MeanFlow Distillation via Integral Velocity

Int-MeanFlow는 individual time step의 instantaneous velocity 대신 time interval에 대한 averaged velocity를 학습하는 것을 목표로 함
- MeanFlow의 coarse-to-fine nature를 retain 하면서 training 시에는 fine-grained detail을 capture 하기 위해, smaller interval을 emphasize 하고 broader temporal dynamics는 gradually learning 함
- Distillation process에서 student model은 flow matching teacher model로 guide 됨
  1. Teacher model은 initial distribution $p_{0}$를 time-dependent vector field $v(z_{t},t;\theta)$를 통해 target distribution $p_{1}$으로 transform 함
  2. State evolution $z_{t}$는 Ordinary Differential Equation (ODE)를 통해 govern 됨:
    (Eq. 1) $ \frac{d}{dt}z_{t}=v(z_{t},t;\theta),\,\,z_{0}\sim p_{0}, \,\, z_{1}\sim p_{1},\,\, t\in[0,1]$
  3. 이때 teacher의 loss function은:
    (Eq. 2) $\mathcal{L}_{CFM}=\mathbb{E}_{t,p_{0}(z_{0}),q(z_{1})}\left[\left|\left| v(z_{t},t;\theta)-(z_{1}-z_{0})\right|\right|^{2}\right]$
- Teacher가 instantaneous velocity $v(z_{t},t;\theta)$ modeling을 학습하는 동안, student model은 time interval $[t,r]$에 대한 averaged velocity를 학습함:
  (Eq. 3) $\bar{v}(z_{t},t,r)=\frac{z_{r}-z_{t}}{r-t}$
  - $z_{t},z_{r}$ : time $t, r$에 대한 state, $z_{r}$은 추론 시 iteratively compute 됨
- Student는 teacher의 instantaneous velocity를 사용하여 averaged velocity를 approximate 하도록 train 됨
  1. 이를 위해 논문은 distillation 시 iterative sampling을 수행함
  2. 먼저 interval $[t,r]$은 $n$ sub-interval로 discretize 되고 time step은 $t_{0}=t,t_{1},...,t_{n}=r$과 같음
  3. 각 step에서 teacher는 ODE의 discrete approximation을 따라 state를 evolve 함:
    (Eq. 4) $z_{t_{k+1}}=z_{t_{k}}+(t_{k+1}-t_{k})\cdot v(z_{t_{k}},t_{k};\theta)$
    - $t_{0}=t, t_{n}=r$, $t_{1},t_{2},...,t_{n-1}$ : intermediate time step
  4. Interval $[t,r]$에 대한 total displacement는:
    (Eq. 5) $\Delta z^{teacher}=\sum_{k=0}^{n-1}(z_{t_{k+1}}-z_{t_{k}}) =\sum_{k=0}^{n-1}(t_{k+1}-t_{k})\cdot v(z_{t_{k}},t_{k};\theta)$
- 해당 discrete displacement는 student가 modeling 하는 continuous process인 $[t,r]$에 대한 instantaneous velocity $v(z_{t},t;\theta)$의 integral을 approximate 함
  1. Averaged velocity를 approximate 하기 위해 displacement는 interval length로 normalize 됨:
    (Eq. 6) $\bar{v}_{teacher}(z_{t},t,r)=\frac{\Delta z^{teacher}}{r-t}$
  2. Averaged velocity의 continuous form은:
    (Eq. 7) $\bar{v}(z_{t},t,r)=\frac{1}{r-t}\int_{t}^{r}v(z_{\tau},\tau;\theta)d\tau$
  3. 결과적으로 teacher의 discrete displacement는 해당 integral의 numerical approximation으로 사용되고, student model은 distillation loss를 minimize 하도록 training 됨:
    (Eq. 8) $\mathcal{L}_{distill}=\mathbb{E}_{t,r}\left[\left|\left| u_{student}(z_{t},t,r)-\bar{v}_{teacher}(z_{t},t,r)\right|\right|^{2}\right]$
    - $u_{student}(z_{t},t,r)$ : student model에서 predict 된 velocity, $\bar{v}_{teacher}(z_{t},t,r)$ : teacher의 target velocity
    - Student model은 teacher guidance를 따라 averaged velocity를 predict 하고, iterative sampling을 통해 instantaneous velocity의 integral을 approximate 함

- Optimal Step Sampling Searching (OS3)

기존 flow-based TTS에서는 NFE requirement를 만족하기 위해 continuous function이나 hard-coded discrete step schedule을 사용함
- 이와 달리 논문은 model inference process에 맞춰 sampling step을 optimize 함
  - Sampling step position의 function에 대한 speech quality는 near-convex behavior를 가지기 때문
- 결과적으로 논문의 Optimal Sampling Step Search (OS3) algorithm은 추론 interval $[0,1]$ 전체에 대해 고정된 수의 sampling step distribution을 optimize 함
  1. OS3는 ternary search를 활용하여 각 sampling step의 placement를 optimize 함
    - 즉, 하나의 sampling step을 제외한 나머지 step을 fix 하고 optimization을 위한 ternary search를 적용함
  2. 해당 process는 각 step마다 repeat 되고 development set에서 further improvement가 없을 때까지 수행됨
    - 이를 통해 OS3는 sampling step의 optimal distribution을 identify 할 수 있음
  3. [Algorithm 1]의 metric function $\mathcal{L}$은 sampling step set $T$, development set을 기반으로 generated audio에 대한 pre-defined metric을 compute 함
    - 논문에서는 speaker similarity를 metric으로 채택함

- Initialization Strategy for Int-MeanFlow

Flow matching model을 Int-MeanFlow에 adapt 하기 위해 additional parameter $r$을 도입함
- $t, r$은 동일한 embedding layer를 통과한 다음 concatenate 되고 linear mapping $\mathbf{W}$를 사용하여 feature space로 project back 됨
- $t, r$의 embedding을 각각 $\mathbf{e}_{t}=\mathcal{E}(t), \mathbf{e}_{r}=\mathcal{E}(r)$ 이라고 하자
  1. Concatenated, mapped embedding $\mathbf{e}_{t,r},\mathbf{e}'_{t,r}$은:
    (Eq. 9) $\mathbf{e}_{t,r}=[\mathbf{e}_{t},\mathbf{e}_{r}],\,\,\,\mathbf{e}'_{t,r}=\mathbf{We}_{t,r}$
  2. Original model behavior를 preserve 하기 위해 $\mathbf{W}$는 다음과 같이 initialize 됨:
    (Eq. 10) $\mathbf{W}=[D_{diag}\,\, 0]$
    - $D_{diag}$ : diagonal matrix

3. Experiments

- Settings

Dataset : Emilia
Comparisons : F5-TTS, CosyVoice2

- Results

전체적으로 Int-MeanFlow의 성능이 가장 우수함

Token-to-Mel 측면에서도 Int-MeanFlow가 가장 뛰어난 성능을 달성함

Int-MeanFlow는 작은 NFE에서도 뛰어난 성능을 보임

Teacher NFE가 클수록 training time은 증가함

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] IPACue-TTS: Integrating Prosody and Articulatory Cues in Conditional Flow Matching for Multilingual Zero-Shot TTS (0)	2026.05.14
[Paper 리뷰] F5E-TTS: Enhancing Speech Synthesis by Aligning Text with Rich Semantic Representations (0)	2026.05.12
[Paper 리뷰] SFM-TTS: Lightweight and Rapid Speech Synthesis with Flexible Shortcut Flow Matching (0)	2026.05.08
[Paper 리뷰] NCF-TTS: Enhancing Flow Matching based Text-to-Speech with Neighborhood Consistency Flow (0)	2026.05.06
[Paper 리뷰] MambaVoiceCloning: Efficient and Expressive Text-to-Speech via State-Space Modeling and Diffusion Control (0)	2026.04.16

최근에 올라온 글

최근에 달린 댓글

« 2026/05 »
일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Int-MeanFlow: Few-Step Speech Generation with Integral Velocity Distillation

Int-MeanFlow: Few-Step Speech Generation with Integral Velocity Distillation

1. Introduction

2. Method

- Int-MeanFlow: MeanFlow Distillation via Integral Velocity

- Optimal Step Sampling Searching (OS3)

- Initialization Strategy for Int-MeanFlow

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바