[Paper 리뷰] FlowDec: A Flow-Based Full-Band General Audio Codec with High Perceptual Quality

티스토리 뷰

Paper/Neural Codec

[Paper 리뷰] FlowDec: A Flow-Based Full-Band General Audio Codec with High Perceptual Quality

feVeRin 2025. 4. 12. 12:12

FlowDec: A Flow-Based Full-Band General Audio Codec with High Perceptual Quality

Lower bitrate에서도 동작하는 general full-band audio codec이 필요함
FlowDec
- Non-adversarial codec training과 conditional flow matching에 기반한 stochastic postfilter를 활용
- Fine-tuning이나 distillation 없이 required postfilter evaluation을 절감
논문 (ICLR 2025) : Paper Link

1. Introduction

Audio codec은 audio waveform을 compact, quantized representation으로 compress 하고, 해당 representation을 기반으로 audio waveform을 faithfully reconstruct 하는 것을 목표로 함
- BUT, 기존 codec은 ad-hoc design과 extensive manual effort가 필요하므로 12kbit/s 이하의 lower bitrate에서 high-fidelity audio coding을 위한 end-to-end optimization이 어려움
  - 한편으로 SoundStream, DAC, AudioDec, EnCodec과 같은 End-to-End (E2E) Neural Codec은 8kbit의 lower bitrate에서도 우수한 audio quality를 달성함
- 특히 ScoreDec과 같이 score-based diffusion이나 flow-based generative model을 도입하여 reconstruction quality를 개선할 수 있음
  - BUT, ScoreDec은 high-bitrate인 24kbit/s에서만 동작 가능하고 DNN evaluation으로 인한 Real-Time-Factor (RTF) 저하 문제가 있음

-> 그래서 lower bitrate에서도 효과적으로 동작하는 flow-based neural codec인 FlowDec을 제안

FlowDec
- 기존 score-based method를 Conditional Flow Matching (CFM) method로 확장
- Fine-tuning, distillation 없이 DNN evaluation을 줄여 lower-bitrate, general full-band audio coding을 지원

< Overall of FlowDec >

CFM에 기반한 full-band neural auido codec
결과적으로 RTF를 크게 개선하고 lower-bitrate에서 high-fidelity perceptual quality를 달성

2. Method

논문은 code $c := E (x *) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>c</mi><mo>:=</mo><mi>E</mi><mo stretchy="false">(</mo><msup><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>*</mo></mrow></msup><mo stretchy="false">)</mo></math>$ 가 주어진 clean audio $x * \in R L <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>*</mo></mrow></msup><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>L</mi></mrow></msup></math>$ 의 estimate $ˆ x \in R L <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>x</mi><mo stretchy="false">^</mo></mover></mrow><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>L</mi></mrow></msup></math>$ 을 reconstruct 하는 stochastic inference problem을 고려함
- 이때 model은 distribution에서 sampling을 통해 clean audio estimate $ˆ x <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>x</mi><mo stretchy="false">^</mo></mover></mrow></math>$ 를 제공하는 것을 목표로 함:
  (Eq. 1) $ˆ x \sim p data (ˆ x | c), c = E (x *) \in Z ℓ, ℓ ≪ L <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>x</mi><mo stretchy="false">^</mo></mover></mrow><mo>\sim</mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mtext>data</mtext></mrow></msub><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mover><mi>x</mi><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>c</mi><mo stretchy="false">)</mo><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>c</mi><mo>=</mo><mi>E</mi><mo stretchy="false">(</mo><msup><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>*</mo></mrow></msup><mo stretchy="false">)</mo><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">Z</mi></mrow><mrow data-mjx-texclass="ORD"><mi>ℓ</mi></mrow></msup><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>ℓ</mi><mo>≪</mo><mi>L</mi></math>$
  - $p data (\cdot | c) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mtext>data</mtext></mrow></msub><mo stretchy="false">(</mo><mo>\cdot</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>c</mi><mo stretchy="false">)</mo></math>$ : code $c <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>c</mi></math>$ 가 주어졌을 때 clean audio의 conditional distribution
- $x * \in R L <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>*</mo></mrow></msup><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>L</mi></mrow></msup></math>$ 을 lower-dimensional discrete representation $c <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>c</mi></math>$ 로 mapping 하는 모든 encoder $E <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>E</mi></math>$ 는 many-to-one mapping이므로 multiple $x * <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>*</mo></mrow></msup></math>$ 은 same code $c <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>c</mi></math>$ 를 가짐
  1. 따라서 $D <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>D</mi></math>$ 가 one-to-one mapping인 경우 ideal property $D (E (x *)) = x * <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>D</mi><mo stretchy="false">(</mo><mi>E</mi><mo stretchy="false">(</mo><msup><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>*</mo></mrow></msup><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>=</mo><msup><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>*</mo></mrow></msup></math>$ 를 fullfilling 하는 것이 formally impossible 함
  2. 대신 (Eq. 2)를 minimizing 하여 $D <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>D</mi></math>$ 를 optimal estimator로 구성할 수 있음:
    (Eq. 2) $min D E x * [dist (D (E (x *)), x *)] <math xmlns="http://www.w3.org/1998/Math/MathML"><munder><mo data-mjx-texclass="OP" movablelimits="true">min</mo><mrow data-mjx-texclass="ORD"><mi>D</mi></mrow></munder><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mrow data-mjx-texclass="ORD"><msup><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>*</mo></mrow></msup></mrow></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><mtext>dist</mtext><mo stretchy="false">(</mo><mi>D</mi><mo stretchy="false">(</mo><mi>E</mi><mo stretchy="false">(</mo><msup><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>*</mo></mrow></msup><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>,</mo><msup><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>*</mo></mrow></msup><mo stretchy="false">)</mo><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>$
    - $dist <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>dist</mtext></math>$ : $L 2, L 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>L</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup><mo>,</mo><msup><mi>L</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msup></math>$ distance와 같은 pairwise distance
  3. BUT, 해당 방식으로 training 되는 경우 domain-specific loss가 있더라도 perceptually pleasing signal을 생성하지 못함
- 이를 해결하기 위해 SoundStream, EnCodec, DAC 등은 adversarial training loss를 도입하여 decoded signal distribution을 natural signal에 close 하도록 유도함
  - BUT, adversarial training은 interpretability가 부족하고 $p (ˆ x), p (x *) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mover><mi>x</mi><mo stretchy="false">^</mo></mover></mrow><mo stretchy="false">)</mo><mo>,</mo><mi>p</mi><mo stretchy="false">(</mo><msup><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>*</mo></mrow></msup><mo stretchy="false">)</mo></math>$ 간의 distance를 properly minimize 하지 못함
- 따라서 논문은 ScoreDec과 같이 $D <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>D</mi></math>$ 를 one-to-many mapping으로 구성함
  1. 즉, FlowDec은 deterministic pre-trained initial decoder $D 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 와 stochastic postfilter $Ω <math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">Ω</mi></math>$ 를 combining 한 stochastic decoder $D s (c) = Ω (D 0 (c)) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">(</mo><mi>c</mi><mo stretchy="false">)</mo><mo>=</mo><mi mathvariant="normal">Ω</mi><mo stretchy="false">(</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">(</mo><mi>c</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$ 의 형태로 구성됨
  2. 여기서 $y := D 0 (c) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>y</mi><mo>:=</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">(</mo><mi>c</mi><mo stretchy="false">)</mo></math>$ 라고 하면 $Ω <math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">Ω</mi></math>$ 는 learned distribution $p Ω (\cdot | y) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi mathvariant="normal">Ω</mi></mrow></msub><mo stretchy="false">(</mo><mo>\cdot</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>y</mi><mo stretchy="false">)</mo></math>$ 로부터 conditional sample $ˆ x \sim p Ω (ˆ x | y) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>x</mi><mo stretchy="false">^</mo></mover></mrow><mo>\sim</mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi mathvariant="normal">Ω</mi></mrow></msub><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mover><mi>x</mi><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>y</mi><mo stretchy="false">)</mo></math>$ 를 생성함
    - 해당 sample은 statistical divergence $D <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow></math>$ 를 mimimize 하여 intractable distribution $p data (\cdot | y) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mtext>data</mtext></mrow></msub><mo stretchy="false">(</mo><mo>\cdot</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>y</mi><mo stretchy="false">)</mo></math>$ 를 approximate 함

- Flow Matching

Flow Matching은 tractable dsitribution $q 0 (x 0) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 에서 intractable data distribution $q 1 (x 1) = p data <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mtext>data</mtext></mrow></msub></math>$ 로 sample을 transport 하는 model을 학습하는 것을 목표로 함
- 이때 Flow Matching은 sample $x 0 \sim q 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>\sim</mo><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 에서 시작하여 다음의 Ordinary Differential Equation (ODE)를 solve 함:
  (Eq. 4) $ddtϕt(x)=ut(ϕt(x)),ϕ0(x)=x0<math xmlns="http://www.w3.org/1998/Math/MathML"><mfrac><mi>d</mi><mrow><mi>d</mi><mi>t</mi></mrow></mfrac><msub><mi>ϕ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><msub><mi>u</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>ϕ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><msub><mi>ϕ</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$
- 여기서 $ϕ t : [0, 1] \times R N \to R N <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>ϕ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>:</mo><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">]</mo><mo>\times</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>N</mi></mrow></msup><mo stretchy="false">\to</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>N</mi></mrow></msup></math>$ 를 flow라 하고, time-dependent vector field $u t : [0, 1] \times R N \to R N <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>u</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>:</mo><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">]</mo><mo>\times</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>N</mi></mrow></msup><mo stretchy="false">\to</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>N</mi></mrow></msup></math>$ 를 통해 $p t = 0 = q 0, p t = 1 = q 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>=</mo><mn>0</mn></mrow></msub><mo>=</mo><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>,</mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>=</mo><mn>1</mn></mrow></msub><mo>=</mo><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$ 인 probability density path $p t : R N \to R > 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>:</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>N</mi></mrow></msup><mo stretchy="false">\to</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mo>></mo><mn>0</mn></mrow></msub></math>$ 을 생성함
- 그러면 $v θ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub></math>$ 는 다음의 CFM training loss를 통해 학습됨:
  (Eq. 5) $L CFM := E x, t, p t (x | x 1) [| | v θ (x, t) - u t (x | x 1) | | 22] <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mtext>CFM</mtext></mrow></msub><mo>:=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mrow data-mjx-texclass="ORD"><mi>x</mi><mo>,</mo><mi>t</mi><mo>,</mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo stretchy="false">)</mo></mrow></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><mi>t</mi><mo stretchy="false">)</mo><mo>-</mo><msub><mi>u</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msubsup><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>$
  - $x 1 \sim q 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>\sim</mo><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$
  - 특히 conditional (Eq. 5)는 intractable unconditional flow matching objective와 동일한 gradient를 가지고 correct unconditional probability path $p t (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></math>$ 와 flow field $u t (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>u</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></math>$ 를 marginalize 함

- Joint Flow Matching for Signal Enhancement

Original flow matching formulation에서 $x 0, x 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>,</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$ 은 zero-mean Gaussian $q 0 = N (0, σ 2 I) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><msup><mi>σ</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup><mi>I</mi><mo stretchy="false">)</mo></math>$ 에서 independently sampling 됨
- $q 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 가 standard Gaussian인 경우 conditional path $p t (x | x 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 은 $q 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 에서 $q 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$ 으로의 Optimal Transport (OT)를 fulfill 하지만 modeled marginal probabilitiy path $p t (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></math>$ 는 OT를 fulfill 하지 않음
  1. 결과적으로 learned marginal flow field $v θ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub></math>$ 에서 high-variance training과 lower-straightness가 발생하므로 inefficient inference와 suboptimal sample quality가 나타남
  2. 이를 해결하기 위해 각 training batch ${(x b, 0, x b, 1)} B b = 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mo fence="false" stretchy="false">{</mo><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>b</mi><mo>,</mo><mn>0</mn></mrow></msub><mo>,</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>b</mi><mo>,</mo><mn>1</mn></mrow></msub><mo stretchy="false">)</mo><msubsup><mo fence="false" stretchy="false">}</mo><mrow data-mjx-texclass="ORD"><mi>b</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>B</mi></mrow></msubsup></math>$ 에서 pairing을 reorder 하고 각 batch에서 OT algorithm을 통해 optimal coupling을 결정하는 per-batch approximation을 도입할 수 있음
    - 즉, $(x 0, x 1) \sim q (x 0, x 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>,</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo stretchy="false">)</mo><mo>\sim</mo><mi>q</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>,</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 을 independently sample 하지 않고 jointly sampling 함
    - 특히 $(x 0, x 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>,</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 를 jointly sampling 하면 OT solver나 extra computation이 필요하지 않음
- 이때 initial estimate $y = D 0 (c) = D 0 (E (x *)) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>y</mi><mo>=</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">(</mo><mi>c</mi><mo stretchy="false">)</mo><mo>=</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">(</mo><mi>E</mi><mo stretchy="false">(</mo><msup><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>*</mo></mrow></msup><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$ 에 access 할 수 있으므로, 논문은 다음의 probability path를 choice 함:
  (Eq. 6) $p t (x t | x 1, y) = N (x t; μ t, σ t) := N (x t; y + t (x 1 - y), (1 - t) 2 Σ y) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>,</mo><mi>y</mi><mo stretchy="false">)</mo><mo>=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>;</mo><msub><mi>μ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><msub><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo><mo>:=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>;</mo><mi>y</mi><mo>+</mo><mi>t</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>-</mo><mi>y</mi><mo stretchy="false">)</mo><mo>,</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>t</mi><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup><msub><mi mathvariant="normal">Σ</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub><mo stretchy="false">)</mo></math>$
  - $Σ y = diag (σ 2 y) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi mathvariant="normal">Σ</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub><mo>=</mo><mtext>diag</mtext><mo stretchy="false">(</mo><msubsup><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup><mo stretchy="false">)</mo></math>$ : diagonal covariance matrix
- 해당 probability path는 $y, x 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>y</mi><mo>,</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$ 간의 linear interpolation에 해당하고, noise는 $σ y <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub></math>$ 에서 $0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>0</mn></math>$ 으로 linearly decrease 하므로 $x 0, x 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>,</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$ 간에 $y <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>y</mi></math>$ 를 통한 coupling이 나타남
  1. 즉, $q 0 (x 0 | x 1, y) = N (x; y, Σ y) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>,</mo><mi>y</mi><mo stretchy="false">)</mo><mo>=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><mi>x</mi><mo>;</mo><mi>y</mi><mo>,</mo><msub><mi mathvariant="normal">Σ</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub><mo stretchy="false">)</mo></math>$ 인 경우 $x 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 의 mean은 $0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>0</mn></math>$ 에서 $y <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>y</mi></math>$ 로 shift 됨
  2. 그러면 marginalized $q 0 (x 0) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 는 아래 그림과 같이 variance가 $σ 2 y <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msubsup></math>$ 이고 training data에 centered 된 Gaussian mixture로 볼 수 있음
  3. 이때 $σ y <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub></math>$ 가 well-chosen 되어 Gaussian이 negligible overlap 되면 per-batch coupling을 통해 optimal 하다고 가정할 수 있으므로, mini-batch OT가 필요하지 않음
    - 따라서 $σ y <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub></math>$ 의 choice는 output quality에 큰 영향을 미침

$q 0 (x 0) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">)</mo></math>$ vs. $q 0 (x 0 ❘ x 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>❘</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo stretchy="false">)</mo></math>$

결과적으로 conditional $u t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>u</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 는,
- Flow Matching을 따라 다음과 같이 derive 됨:
  (Eq. 7) $ut(x|x1,y)=x1−xt1−t<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>u</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>,</mo><mi>y</mi><mo stretchy="false">)</mo><mo>=</mo><mfrac><mrow><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>−</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mrow><mrow><mn>1</mn><mo>−</mo><mi>t</mi></mrow></mfrac></math>$
- $x t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 는 $x 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 통해 나타낼 수 있으므로:
  (Eq. 8) $x t = t x 1 + (1 - t) x 0, x 0 \sim N (x 0; y, Σ y) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mi>t</mi><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>t</mi><mo stretchy="false">)</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>\sim</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>;</mo><mi>y</mi><mo>,</mo><msub><mi mathvariant="normal">Σ</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub><mo stretchy="false">)</mo></math>$
  (Eq. 9) $= t x 1 + (1 - t) y + (1 - t) σ t ϵ, ϵ \sim N (0, I) <math xmlns="http://www.w3.org/1998/Math/MathML"><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mo>=</mo><mi>t</mi><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>t</mi><mo stretchy="false">)</mo><mi>y</mi><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>t</mi><mo stretchy="false">)</mo><msub><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mi>ϵ</mi><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>ϵ</mi><mo>\sim</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mi>I</mi><mo stretchy="false">)</mo></math>$
  (Eq. 10) $x 0 = y + σ y ϵ, ϵ \sim N (0, I) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>=</mo><mi>y</mi><mo>+</mo><msub><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub><mi>ϵ</mi><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>ϵ</mi><mo>\sim</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mi>I</mi><mo stretchy="false">)</mo></math>$
- (Eq. 6)을 따라 $x 1 = x * <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>=</mo><msup><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>*</mo></mrow></msup></math>$ 을 대입하면 simple joint flow matching loss를 얻을 수 있음:
  (Eq. 11)
  - $D <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="fraktur">D</mi></mrow></math>$ : training dataset
- 해당 loss는 $x 0, x 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>,</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$ 에 대해 reparameterize 함으로써 (Eq. 7)의 $t \approx 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi><mo>\approx</mo><mn>1</mn></math>$ 주변의 numerical stability를 제거함
  1. 특히 $σ y > 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub><mo>></mo><mn>0</mn></math>$ 을 choice 하면 flow field가 contractive mapping이 되도록 force 할 수 있음
    - 이를 통해 inference를 위한 ODE는 numerically stable 하고 locally converge 됨
  2. $p t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 에 대한 choice는 trajectory가 $x * <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>*</mo></mrow></msup></math>$ 에 exactly reach 하도록 하여 SGMSE를 개선함
    - 기존 SGMSE는 correct $q 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 를 modeling 하지 않으므로 fail 할 수 있음
- 추가적으로 논문은 multiple hyperparameter를 가지는 Stochastic Differential Equation (SDE) 대신, 하나의 hyperparameter $σ y <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub></math>$ 만을 사용하는 data-based heuristic을 도입함:
  (Eq. 12) $σy=13√Q(|X∗−Y|2,0.997)<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub><mo>=</mo><mfrac><mn>1</mn><mn>3</mn></mfrac><msqrt><mi>Q</mi><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mrow data-mjx-texclass="ORD"><mo>∗</mo></mrow></msup><mo>−</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><msup><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup><mo>,</mo><mn>0.997</mn><mo stretchy="false">)</mo></msqrt></math>$
  - $Q <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>Q</mi></math>$ : quantile operation
- 한편으로 independent CFM formulation을 사용하여 constant $σ t = σ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>=</mo><mi>σ</mi></math>$ 를 얻고 sampled noise에 대해 target flow field $u t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>u</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 가 independent 하도록 할 수 있음
  1. BUT, 해당 방식은 $σ 1 = σ > 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>σ</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>=</mo><mi>σ</mi><mo>></mo><mn>0</mn></math>$ 이므로 아래 그림과 같이 non-contractive flow field와 estimate에 residual noise가 나타남
  2. 반면 FlowDec은 $σ 1 = 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>σ</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>=</mo><mn>0</mn></math>$ 이므로 postfiltering task에서 더 나은 quality를 달성할 수 있음

$t = 0.7 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi><mo>=</mo><mn>0.7</mn></math>$ 에서 Flow Field

FlowDec은 $x *, y <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>*</mo></mrow></msup><mo>,</mo><mi>y</mi></math>$ 를 invertible feature extractor $Φ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">Φ</mi></math>$ 의 feature representation $X *, Y <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mrow data-mjx-texclass="ORD"><mo>*</mo></mrow></msup><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow></math>$ 로 replace 하여 feature domain의 flow를 학습함
- 이때 $Φ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">Φ</mi></math>$ 는 compression exponent $α = 0.3 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>α</mi><mo>=</mo><mn>0.3</mn></math>$ 의 amplitude-compressed complex STFT를 채택함
  - 추가적으로 input에서 channel-wise concatenate를 통해 $Y <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow></math>$ 에 $v θ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub></math>$ 의 conditioning을 제공함
- Training 이후 flow model $v θ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub></math>$ 는 (Eq. 4)의 ODE와 함께 conditional distribution $p Ω (X * | Y) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi mathvariant="normal">Ω</mi></mrow></msub><mo stretchy="false">(</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mrow data-mjx-texclass="ORD"><mo>*</mo></mrow></msup><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mo stretchy="false">)</mo></math>$ 를 modeling 함
  1. 이때 clean feature estimate $ˆ X \sim p Ω <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mo stretchy="false">^</mo></mover></mrow><mo>\sim</mo><msub><mi>p</mi><mrow data-mjx-texclass="ORD"><mi mathvariant="normal">Ω</mi></mrow></msub></math>$ 를 생성하기 위해, initial state (latent) $X 0 \sim q 0 (X 0 | Y) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo>\sim</mo><msub><mi>q</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">Y</mi></mrow><mo stretchy="false">)</mo></math>$ 를 sample 함
  2. 이후 $t = 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi><mo>=</mo><mn>0</mn></math>$ 에서 $t = 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi><mo>=</mo><mn>1</mn></math>$ 까지 $v θ <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub></math>$ 를 사용하여 numerical ODE solver를 통해 flow (Eq. 4)를 solve 하여 $ˆ X 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$ 을 구함
    - 논문은 3-step Mid-point Solver ( $NFE = 6 <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>NFE</mtext><mo>=</mo><mn>6</mn></math>$ )를 사용
  3. 최종적으로 feature extractor $Φ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">Φ</mi></math>$ 의 inverse를 사용하여 waveform estimate $ˆ x = Φ - 1 (ˆ X 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>x</mi><mo stretchy="false">^</mo></mover></mrow><mo>=</mo><msup><mi mathvariant="normal">Φ</mi><mrow data-mjx-texclass="ORD"><mo>-</mo><mn>1</mn></mrow></msup><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 을 생성함

- Non-Adversarial Codec Training

Effective phase loss 없이 spectral loss 만으로 training 된 NAR audio generative model은 unsynchronized phase로 인한 buzzy noise가 발생함
- 이때 adversarial training을 도입하면 해당 문제를 해결하고 natural-sounding audio를 얻을 수 있음
  - BUT, unstable training, mode-collapse, handcrafted multi-discriminator design 등의 문제가 발생함
- 따라서 논문은 adversarial training을 제거하는 대신 generative postfilter를 도입함
  1. 즉, adversarial loss 없이 deterministic neural codec을 initial decoder $D 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 로 training 하고 stochastic postfiler $Ω <math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">Ω</mi></math>$ 를 통해 output audio와 clean audio의 distribution을 matching 함
  2. 구조적으로는 DAC와 같은 neural codec을 $D 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></math>$ 로 채택하고 adversarial loss term과 관련된 모든 component를 제거하여 사용함

- Underlying Codec: Improved Non-Adversarial DAC

FlowDec의 stochastic postfilter는 ScoreDec과 같이 any underlying codec에 대해 training 되어 waveform estimate를 향상함
- 이때 다른 sampling rate, bitrate에 대한 adaptibility가 우수한 DAC를 underlying codec의 basis로 채택함
  - 추가적으로 adversarial loss를 제거하고 아래 표와 같이 configuration을 modify 함
- 한편으로 해당 non-adversarial loss를 training 할 때 $- 30 dB <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>-</mo><mn>30</mn><mtext>dB</mtext></math>$ 의 Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) value가 발생하는 경우가 있음
  1. 이는 low-frequency ( $\leq 2 kHz <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>\leq</mo><mn>2</mn><mtext>kHz</mtext></math>$ )가 badly modeling 되기 때문으로, 논문은 Multiscale Constant-Q Transform (CQT) loss를 도입하여 해당 문제를 해결함
  2. 추가적으로 DAC의 multiscale Mel loss와 같이 amplitude와 log-amplitude의 difference를 모두 사용하고, SI-SDR과 phase error를 반영하는 $L 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>L</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msup></math>$ waveform-domain loss를 추가함

- Frequency-Dependent Noise Levels

$σ y <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub></math>$ choice는 output quality에 큰 영향을 미침
- 이때 single scalar $σ y <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub></math>$ 를 사용하면 added Gausian noise가 high-frequency를 dominate 할 때 over-smoothing이 나타날 수 있음
- 따라서 논문은 각 STFT frequency band에 대해 (Eq. 12)의 heuristic quantile calculation을 independently performing 하여 frequency-dependent curve $σ y (f) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>σ</mi><mrow data-mjx-texclass="ORD"><mi>y</mi></mrow></msub><mo stretchy="false">(</mo><mi>f</mi><mo stretchy="false">)</mo></math>$ 를 calculate 함

3. Experiments

- Settings

Dataset : 아래 표 참조
Comparisons : DAC, EnCodec, ScoreDec

- Results

전체적으로 FlowDec이 가장 우수한 성능을 보임

Perception-Distortion trade-off 측면에서도 FlowDec은 DAC 보다 더 robust 함

ScoreDec과의 비교에서도 FlowDec이 더 뛰어남

Spectrogram 측면에서도 FlowDec은 더 나은 reconstruction이 가능함

Listening Test
- Subjective evaluation을 위한 listening test parameter는 아래 표와 같이 설정됨

Subjective evaluation 측면에서도 FlowDec이 가장 우수함

Real-Time-Factor (RTF)
- $NFE = 6 <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>NFE</mtext><mo>=</mo><mn>6</mn></math>$ 의 default setting에서 FlowDec-75의 total RTF는 0.2285, FlowDec-25는 0.2235와 같음
- 즉, FlowDec은 ScoreDec의 1.707 RTF와 비교하여 상당한 RTF 개선을 보임

'Paper > Neural Codec' 카테고리의 다른 글

[Paper 리뷰] FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates (0)	2025.04.22
[Paper 리뷰] FunCodec: A Fundamental, Reproducible and Integrable Open-Source Toolkit for Neural Speech Codec (0)	2025.04.08
[Paper 리뷰] ComplexDec: A Domain-Robust High-Fidelity Neural Audio Codec with Complex Spectrum Modeling (0)	2025.03.27
[Paper 리뷰] RepCodec: A Speech Representation Codec for Speech Tokenization (0)	2025.02.22
[Paper 리뷰] Generative De-quantization for Neural Speech Codec via Latent Diffusion (0)	2024.07.18

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] FlowDec: A Flow-Based Full-Band General Audio Codec with High Perceptual Quality

FlowDec: A Flow-Based Full-Band General Audio Codec with High Perceptual Quality

1. Introduction

2. Method

- Flow Matching

- Joint Flow Matching for Signal Enhancement

- Non-Adversarial Codec Training

- Underlying Codec: Improved Non-Adversarial DAC

- Frequency-Dependent Noise Levels

3. Experiments

- Settings

- Results

'Paper > Neural Codec' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역