[Paper 리뷰] Diffusion-Based Generative Speech Source Separation

티스토리 뷰

Paper/Separation

[Paper 리뷰] Diffusion-Based Generative Speech Source Separation

feVeRin 2024. 1. 2. 12:02

Diffusion-Based Generative Speech Source Separation

Source separation을 위해 Stochastic Differential Equation을 활용할 수 있음
DiffSep
- 분리된 source에서 시작해 mixture를 중심으로 하는 Gaussian 분포로 수렴하는 continuous time diffusion-mixing proces를 활용
- Diffusion-mixing process의 score function에 대한 marginal probability를 근사하는 neural network를 훈련
- Neural network를 활용하여 mixture에서 source를 점진적으로 분리하는 reverse-time SDE를 solve
논문 (ICASSP 2023) : Paper Link

1. Introduction

Source Separation은 mixture에서 관심 있는 signal을 분리하는 것을 의미함
- 초기에는 Non-Negative Matrix Factorization (NMF)를 주로 활용했음
- 이후 Deep Neural Network (DNN)의 등장으로 분리 성능이 크게 향상됨
  - DNN은 source cluster를 예측하는 것에서 mask value를 직접 예측하는 것으로 변화

위와 같은 data-driven 방식들은 fundamental source permutation ambiguity를 해결해야 함
- Source에 대한 inherent preffered order가 없기 때문
- 일반적으로 permutation ambiguity 해결을 위해 permutation invariant training (PIT)이 주로 활용됨
  - Source의 모든 permutation에 대해 objective function을 계산하고 최솟값을 선택해 gradient를 계산하는 방식
  - Time-domain에서는 Conv-TasNet, Dual-path transformer 등
  - Time-Frequency domain에서는 WaveSplit, TF-GridNet 등
- 대부분 Scale-Invariant Signal-to-Distortion Ratio (SI-SDR)을 통해 학습됨
  -> NMF를 제외하고는 generative model을 separation 작업에 활용하는 경우가 거의 없음
Generative modelling은 복잡한 data 분포를 쉽게 근사할 수 있음
- 특히 Score-based Generative Modelling (SGM)이 좋은 성능을 보임
  - Target 분포를 Gaussian noise로 점진적으로 변환하는 forward process를 정의하고, log-probability에 대한 gradient인 score function을 통해 reverse process를 수행하여 target 분포를 sampling
- 이때, score function은 두 가지 방식을 통해 DNN으로 근사할 수 있음
  1. Graphical model
  2. Stochastic Differential Equation (SDE)
- 이 중 SDE를 활용하는 diffusion-based separataion은 아직 제시되지 않음

-> 그래서 diffusion-based source separation model인 DiffSep를 제안

DiffSep
- 분리된 signal로부터 mixture의 평균으로 수렴하는 SDE를 설계
- Reverse-time SDE를 풀어 mixture로부터 개별적인 source를 복원
- 추론 단계에서 source assignment의 ambiguity를 해소하는 score-matching network에 대한 학습방식 제시

< Overall of DiffSep >

SDE solving을 통해 mixture에서 source를 점진적으로 분리하는 diffusion-based model
자연스럽고 높은 non-intrusive score를 달성할 수 있는 source separation 모델
Noise를 extra source로 고려함으로써 speech enhancement로의 확장 가능성 제시

2. Background

- Notation and Signal Model

Vector $x <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi></math>$ 에 대한 norm : $| | x | | = (x T x) 1 / 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">|</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mo>=</mo><mo stretchy="false">(</mo><msup><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msup><mi>x</mi><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>1</mn><mrow data-mjx-texclass="ORD"><mo>/</mo></mrow><mn>2</mn></mrow></msup></math>$ , $I N <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>I</mi><mrow data-mjx-texclass="ORD"><mi>N</mi></mrow></msub></math>$ : $N \times N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi><mo>\times</mo><mi>N</mi></math>$ identity matrix
Audio signal은 $N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi></math>$ 개 sample을 가지는 $R N <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>N</mi></mrow></msup></math>$ 의 real valued vector로써 표현됨
- $K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi></math>$ 개 source에 대한 mixture는:
  (Eq.1) $y = \sum K k = 1 s k <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>y</mi><mo>=</mo><munderover><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>K</mi></mrow></munderover><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow></msub></math>$
  - 모든 k에 대해, $s k \in R N <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>N</mi></mrow></msup></math>$
- 여기서 model은 extra source를 추가함으로써 noise, reverberation, degradation을 설명할 수 있음
  - 모든 source signal을 concatenating 하여 $R K N <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>K</mi><mi>N</mi></mrow></msup></math>$ 의 vector를 얻을 수 있음
  - 이때 분리된 source의 vector와 그 평균값은:
  (Eq.2) $s = [s T 1, . . ., s T K] T, ˉ s = K - 1 [y T, . . ., y T] T <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>s</mi><mo>=</mo><mo stretchy="false">[</mo><msubsup><mi>s</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msubsup><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msubsup><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>K</mi></mrow><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msubsup><msup><mo stretchy="false">]</mo><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msup><mo>,</mo><mrow data-mjx-texclass="ORD"><mover><mi>s</mi><mo stretchy="false">¯</mo></mover></mrow><mo>=</mo><msup><mi>K</mi><mrow data-mjx-texclass="ORD"><mo>-</mo><mn>1</mn></mrow></msup><mo stretchy="false">[</mo><msup><mi>y</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msup><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msup><mi>y</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msup><msup><mo stretchy="false">]</mo><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msup></math>$
Time-invariant mixing operation을 $(A \otimes I N) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mi>A</mi><mo>\otimes</mo><msub><mi>I</mi><mrow data-mjx-texclass="ORD"><mi>N</mi></mrow></msub><mo stretchy="false">)</mo></math>$ 으로 정의할 수 있음
- $A <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>A</mi></math>$ : $K \times K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi><mo>\times</mo><mi>K</mi></math>$ matrix, $\otimes <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>\otimes</mo></math>$ : Kronecker product
- Vector $v \in R K N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>v</mi><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>K</mi><mi>N</mi></mrow></msup></math>$ 에 위의 matrix를 곱하는 것은, interval $N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi></math>$ 에서 $v <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>v</mi></math>$ 로부터 가져온 element와 length $K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi></math>$ 의 모든 sub-vector를 $A <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>A</mi></math>$ 로 곱하는 것과 동일함:
  (Eq.3) $((A \otimes I N) v) k N + n = \sum K ℓ = 1 A k ℓ v ℓ N + n <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mo stretchy="false">(</mo><mi>A</mi><mo>\otimes</mo><msub><mi>I</mi><mrow data-mjx-texclass="ORD"><mi>N</mi></mrow></msub><mo stretchy="false">)</mo><mi>v</mi><msub><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mi>k</mi><mi>N</mi><mo>+</mo><mi>n</mi></mrow></msub><mo>=</mo><munderover><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>ℓ</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>K</mi></mrow></munderover><msub><mi>A</mi><mrow data-mjx-texclass="ORD"><mi>k</mi><mi>ℓ</mi></mrow></msub><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>ℓ</mi><mi>N</mi><mo>+</mo><mi>n</mi></mrow></msub></math>$
  - $k = 1, . . ., K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><mi>K</mi></math>$ , $n = 1, . . ., N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><mi>N</mi></math>$
이후 식을 간단히 하기 위해 regular matrix product $A v ≜ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>A</mi><mi>v</mi><mo>≜</mo><mo stretchy="false">(</mo><mi>A</mi><mo>\otimes</mo><msub><mi>I</mi><mrow data-mjx-texclass="ORD"><mi>N</mi></mrow></msub><mo stretchy="false">)</mo><mi>v</mi></math>$ 라고 하고,
- Projection matrix는 아래와 같이 정의됨:
  (Eq.4) $P = K^{- 1} 1 1^{T}, \bar{P} = I_{k} - P$
  - $1$ : $K$ size의 one vector
- 이때 $P$ 는 평균값들의 subspace로 projection 되고 $\bar{P}$ 는 orthogonal complement를 의미
  - (Eq.1)에 대한 compact alternative는 $P s = \bar{s}$ 로 표현될 수 있음

- SDE for Score-based Generative Modelling

SDE를 통한 Score-based Generative Modelling (SGM)은 복잡한 data 분포를 모델링하는 방법임
- Target 분포의 sample에서 시작해 Gaussian 분포로 수렴하는 diffusion process는 다음의 SDE로 표현됨:
  (Eq.5) $d x_{t} = f (x_{t}, t) d t + g (t) d w_{t}$
  - $x_{t} : R^{N} \to R^{N}$ : time $t \in R$ 의 vector-valued function
  - $d x_{t}$ : $t$ 에 대한 도함수
  - $d t$ : infinitesimal time step
  - $f : R^{N} \to R^{N}$ , $g : R \to R$ : $x_{t}$ 의 dirft, diffusion coefficient
  - $d w_{t}$ : standard Wiener process
- 본 논문에서는 Diffusion process의 time $t$ 가 audio signal의 time과 관계없다고 가정함
- (Eq.5)에서 주목할 점은, mild condition에서 corresponding reverse-time SDE가 존재한다는 것:
  (Eq.6) $d x_{t} = - [f (x_{t}, t) - g (t)^{2} \nabla_{x_{t}} l o g p_{t} (x_{t})] d t + g (t) d \bar{w}$
  - 이때 $t = T$ 에서 $t = 0$ 으로 이동, $d \bar{w}$ : reverse Brownian process, $p_{t} (x)$ : $x_{t}$ 의 marginal 분포
Forward process 동안 $x_{0}$ 가 알려지면, 일반적으로 $p_{t} (x)$ 에 대한 closed-form을 가짐
- Affine $f (x, t)$ 는 Gaussian이고 해당 parameter가 평균, 공분산 matrix이므로 tractable 함
  - Reverse process에서 $p_{t} (x)$ 는 unknown이므로 (Eq.6)은 directly solve가 불가능함
- 따라서, SGM은 $q_{θ} (x, t) \approx \nabla_{x} l o g p_{t} (x)$ 가 되도록 neural network $q_{θ} (x, t)$ 를 학습시키는 것을 목표로 함
  - 충분히 근사되었다면 (Eq.6)을 수치적으로 풀어 $\nabla_{x} l o g p_{t} (x)$ 를 $q_{θ} (x, t)$ 로 대체함으로써 $x_{0}$ 분포에서 sample을 생성할 수 있음

3. Diffusion-Based Source Separation

Source separation을 위해 SGM을 도입하여 시간이 지남에 따라 source 전반에 걸쳐 diffusion과 mixing이 발생하는 forward SDE를 설계
- Process의 각 step은 infinitesimal noise를 더하고 infinitesimal mixing을 수행하는 것으로 볼 수 있음
- 이를 SDE로 공식화하면:
  (Eq.7) $d x_{t} = - γ \bar{P} x_{t} d t + g (t) d w, x_{0} = s$
  - $\bar{P}$ : (Eq.4), $s$ : (Eq.2)
  - Diffusion coefficient로써 Variance Exploding SDE를 사용:
  (Eq.8) $g (t) = σ_{m i n} {(\frac{σ_{m a x}}{σ_{m i n}})}^{t} \sqrt{2 l o g (\frac{σ_{m a x}}{σ_{m i n}})}$
  -> (Eq.7)은 $t$ 가 커짐에 따라 marginal $μ_{t}$ 의 평균이 $t = 0$ 에서 분리된 signal vector에서 mixture vector $\bar{s}$ 로 변하는 속성을 가짐
- [Theorem 1.] 따라서, $x_{t}$ 의 marginal 분포는
  (Eq.9) $μ_{t} = (1 - e^{- γ t}) \bar{s} + e^{- γ t} s$ , (Eq.10) $σ_{t} = λ_{1} (t) P + λ_{2} (t) \bar{P}$
  를 평균, 공분산 matrix로 가지는 Gaussian 분포이고,
  이때 $ξ_{1} = 0, ξ_{2} = γ, ρ = \frac{σ_{m a x}}{σ_{m i n}}$ 이라고 하면,
  (Eq.11) $λ_{k} (t) = \frac{σ_{m i n}^{2} (ρ^{2 t} - e^{2 ξ_{k} t} l o g ρ)}{ξ_{k} + l o g ρ}$
  이다.
- 결과적으로 Sample $x_{t}$ 에 대한 explicit expression은:
  (Eq.12) $x_{t} = μ_{t} + L_{t} z, z \sim N (0, I_{K N})$
  - $L_{t} = λ_{1} (t)^{1 / 2} P + λ_{2} (t)^{1 / 2} \bar{P}$
- 이때 시간에 따른 $x_{t}$ 분포의 parameter 변화를 확인해 보면,
  - $γ$ 값을 조정하여 $μ_{t}$ 와 $\bar{s}$ 의 차이를 작게 만들 수 있음
  - Mixing process로 인해 source에 더해지는 noise의 correlation이 시간에 따라 증가함

Speech enhancement를 위해 일반적으로 STFT와 같은 non-linear transform을 활용하여 diffusion process를 적용함
- BUT, (Eq.8)은 source의 linear mixing을 modelling 하므로 non-linear transform을 적용할 수 없음
- 대신 time-domain에서 diffusion process를 수행하고, non-linear STFT domain에서 network를 동작
  - Noise Conditioned Score Matching Network (NCSN)을 활용
  1. STFT와 iSTFT layer가 network 앞뒤에 위치
  2. STFT 이후에 non-linear transform $c (x)$ 을 추가하고, iSTFT 이전에 그 역 연산을 수행함:
  $c (x) = β^{- 1} | x |^{α} e^{j ∠ x}$ , $c^{- 1} (x) = β | x |^{1 / α} e^{j ∠ x}$
  - 실수부, 허수부는 concatenate 됨

- Inference

추론 시 score-matching network $q_{θ} (x, t, y) \approx \nabla_{x} l o g p_{t} (x)$ 를 사용하여 (Eq.5)을 solve 함으로써 source separation을 수행
- 이때 initial value는 아래 분포로부터 sampling 됨:
  (Eq.13) ${\bar{x}}_{T} \sim N (\bar{s}, Σ_{T} I_{K N})$
  - 이후 (Eq.5)를 solve 하기 위해 predictor-corrector approach를 채택
- Prediction step은 reverse diffuse sampling을 통해 수행되고, correction은 annealed Langevin sampling을 통해 수행됨

- Permutation and Mismatch-aware Training Procedure

Score network $q_{θ} (x, t, y)$ 를 학습시키기 위해 modified score-matching procedure를 도입
- [Theorem 1.]에 의해 score function은 closed form을 가지므로 계산식은:
  (Eq.14) $\nabla_{x_{t}} l o g p (x_{t}) = - Σ_{t}^{- 1} (x_{t} - μ_{t}) = - L_{t}^{- 1} z$
- 이때 Training loss는:
  (Eq.15) $L = E_{x_{0, z, t}} | | q_{θ} (x_{t}, t, y) + L_{t}^{- 1} z | |_{Σ_{t}}^{2}$
  (Eq.16) $= E_{x_{0, z, t}} | | L_{t} q_{θ} (x_{t}, t, y) + z | |^{2}$
  - $z \sim N (0, I_{K N})$ , $t \sim U (t_{ϵ}, T)$
  - $x_{0}$ : dataset에서 무작위로 선택됨
  - $Σ_{t}$ : weighting scheme으로 유도된 norm
BUT, 위의 절차로만 학습을 수행할 경우 추론 성능이 크게 개선되지 않음
1. (Eq.13)의 $E [{\bar{x}}_{T}] = \bar{s}$ 와 $μ_{T}$ 사이의 불일치가 존재하기 때문
  - $t = T$ 에서 true score function은 (Eq.14)처럼 단순화되지 않으므로,
  - 대신 $\nabla_{x_{T}} l o g p ({\bar{x}}_{T}) = - Σ_{T}^{- 1} (\bar{s} + L_{T} z - μ_{T})$ 를 사용
2. Network는 source를 출력할 order를 결정해야 하기 때문
  - Model mismatch를 포함하는 PIT objective로 network를 학습하면 됨
  1. 확률 $1 - p_{T}, p_{T} \in [0, 1]$ 인 각 sample에 대해, (Eq.16)을 최소화하는 regular score-matching procedure를 적용
  2. 확률 $p_{T}$ 에서 $t = T$ 로 두고, 아래의 alternative loss를 최소화:
  $L = E_{x_{0}, z} m i n_{π \in P} | | L_{T} q_{θ} ({\bar{x}}_{T}, T, y) + z + L_{T}^{- 1} (\bar{s} - μ_{T} (π)) | |^{2}$
  - ${\bar{x}}_{T}$ : (Eq.13)에서 sampling 되는 값, $P$ : source permutations의 집합, $μ_{T} (π)$ : permutation $π$ 에 대해 계산된 $μ_{T}$
  -> 이를 통해 score network는 noise와 model mismatch를 모두 제거하는 방법을 학습할 수 있고, 결과적으로 speech enhancement시 source order에 대한 ambiguity를 줄일 수 있음

4. Experiments

- Settings

Datasets : WSJ0_2mix, Libri2Mix, VoiceBank-DEMAND
Comparisons : Conv-TasNet, CDiffuse, SGMSE

- Results

Separation
- SI-SDR, PESQ, STOI 측면에서는 DiffSep의 성능이 Conv-TasNet보다 낮게 측정됨
  - 비슷한 소리를 내는 speaker에 대해 natural sounding permutation이 발생하기 때문
- OVRL 측면에서는 DiffSep가 우수한 성능을 보임

Clean source와 분리된 source의 spectrogram을 비교해 보면, 낮은 품질의 sample에서 source에 대한 block permutation이 발생함

Enhancement
- DiffSep는 Conv-TasNet에 비해서는 상대적으로 낮은 성능을 보이지만, 다른 diffusion-based 방법들과 비교했을 때 더 나은 성능을 보임
- OVRL 측면에서 DiffSep는 높은 OVRL 점수를 나타내어 좋은 enhancement 품질을 가진다고 볼 수 있음

'Paper > Separation' 카테고리의 다른 글

[Paper 리뷰] NAS-TasNet: Neural Architecture Search for Time-Domain Speech Separation (0)	2024.01.07
[Paper 리뷰] Hybrid Transformers for Music Source Separation (0)	2023.12.21
[Paper 리뷰] Attention-based Neural Network for End-to-End Music Separation (0)	2023.09.23
[Paper 리뷰] On Loss Functions and Evaluation Metrics For Music Source Separation (0)	2023.09.22
[Paper 리뷰] FC-U $^{2}$ -Net: A Novel Deep Neural Network for Singing Voice Separation (0)	2023.09.20

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] Diffusion-Based Generative Speech Source Separation

Diffusion-Based Generative Speech Source Separation

1. Introduction

2. Background

- Notation and Signal Model

- SDE for Score-based Generative Modelling

3. Diffusion-Based Source Separation

- Inference

- Permutation and Mismatch-aware Training Procedure

4. Experiments

- Settings

- Results

'Paper > Separation' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역