[Paper 리뷰] Attention-based Neural Network for End-to-End Music Separation

티스토리 뷰

Paper/Separation

[Paper 리뷰] Attention-based Neural Network for End-to-End Music Separation

feVeRin 2023. 9. 23. 14:06

Attention-based Neural Network for End-to-End Music Separation

End-to-End separation은 speech separation 분야에서 우수한 성능을 보였지만 music separation에서는 아직 접목되지 않음
Sampling rate가 높은 dual channel data인 음악 신호를 모델링하기 위한 적절한 방법이 필요
Attention-based End-to-End Music Separation
- 멜로디, 톤과 같은 음악의 장기적인 특성을 캡처하기 위한 densely connected U-Net
- Separation module에 multi-head attention과 dual-path transformer를 적용
논문 (CAAI 2023) : Paper Link

1. Introduction

대중음악은 동일한 track을 공유하는 여러 객체들을 포함하는 mono / binaural audio file
- Music separation은 audio file을 여러 track들로 분리하는 작업
- 음악을 반주(accompaniment), 보컬, 악기(드럼, 베이스, 그 외)로 분리
전통적인 music separation은 신호 처리나 수학적 decomposition에 기반을 두고 있음
- Robust Principal Component Analysis (RPCA), Non-negative Matrix Factorization (NMF) 등
- BUT, 이러한 전통적 접근 방식들은 복잡한 모델링을 필요로 함
Deep learning-based 방식은 주로 음악을 time-domain sampling point를 2차원 spectrogram으로 변환하여 활용
- Feature learning-based, End-to-End system의 두 가지 접근 방식
  - Feature learning-based 방식
  : Soft masking, multi-channel filtering 사용해 추정된 source spectrogram에서 분리를 수행
  - End-to-End 방식
  : 원본 audio waveform에서 분리된 waveform을 직접 계산
- 대부분 magnitude spectrogram를 활용한 feature learning-based 방식에 집중하고 있어, end-to-end 방식에 대한 연구는 여전히 부족

-> 그래서 효과적인 time-domain end-to-end music separation 방식을 제안

Attention-based End-to-End music separation
- 신호의 phase information을 활용할 수 있는 front-end trainable encoder를 사용
  - Feature learning-based 방식에서 사용되는 Fourier transform 기반 T-F spectrogram을 대체
  - Time-domain sampling point를 사용해 입력 신호의 무결성을 보장 가능
  - 부정확한 phase 추정으로 인한 성능 손실 방지
- Muisc separation module에 multi-head attention의 도입
  - Encoder 출력에 가중치 부여
  - 음악의 장거리 정보를 캡처
- Dual-path transformer를 통한 intra-segment와 inter-segment 간 cross-modeling
  - RNN의 frame-level modeling과는 달리 segment scale로 modeling 가능
  - 멜로디, 장르와 같은 음악 시퀀스의 장기적 정보를 캡처 가능

< Overall of This Papaer >

Stereo channel들 간의 정보를 효과적으로 캡처하는 front-end CNN encoder
다양한 악기 정보에 대해 가중치를 부여하는 mulit-head attention
Dual-path transformer에 기반한 separation module

2. Proposed Network

Attention, CNN 기반의 end-to-end music separation model은 크게 encoder, separtor, decoder로 이루어짐
- Encoder는 1D convolution을 활용하는 $N N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi></math>$ 개의 블록으로 구성
  - Input feature map을 반복적으로 downsampling하여 다양한 scale의 high-dimensional feature를 얻기 위함
- Separator는 encoding된 high-dimensional feature를 분리하고 decoder에 입력
- Decoder는 deconvolution을 포함하는 블록으로 구성
  - 분리된 결과는 반복적으로 upsampling 되어 신호 scale을 복원하여, 분리된 target source의 time-domain waveform을 출력
- Skip connection을 통해 encoder와 decoder 사이의 정보를 전달

Attention-based Music Separation Model의 전체 구조

- Encoder and Decoder

Convolution 연산은 신호의 channel correlation을 얻는데 효과적임
- Convolution은 음악 신호의 inter-channel feature를 캡처할 수 있고, 입력 신호에 대한 Finite Impulse Response (FIR) filter로 근사할 수 있기 때문
- 따라서, time-domain 신호에 대한 convolution 결과는 T-F domain representation으로 근사 가능
Wave-U-Net, Conv-Tasnet의 separation module에 기반한 새로운 encoder 구조 설계
- 1D convolution, GLU activation, depthwise separable convolution, Leaky-ReLU activation으로 구성
  - Depthwise separable convolution은 network의 깊이를 늘리면서 parameter 수를 효과적으로 줄일 수 있음
- 음악 신호의 높은 sampling rate와 waveform의 dense 한 분포로 인해 상대적으로 큰 receptive field가 필요
  - Encoder의 input channel $C i n = C = 2 C_{i n} = C = 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>C</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>n</mi></mrow></msub><mo>=</mo><mi>C</mi><mo>=</mo><mn>2</mn></math>$
  - 첫번째 encoding module의 output channel $C 1 = 32 C_{1} = 32 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>C</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>=</mo><mn>32</mn></math>$
  - 이후 이어지는 encoding module들의 output channel은 input channel의 두 배 ( $C i = 2 C i - 1, i = 2, 3, . . ., 6 C_{i} = 2 C_{i - 1}, i = 2, 3, . . ., 6 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>C</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo>=</mo><mn>2</mn><msub><mi>C</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>-</mo><mn>1</mn></mrow></msub><mo>,</mo><mi>i</mi><mo>=</mo><mn>2</mn><mo>,</mo><mn>3</mn><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><mn>6</mn></math>$ )
Decoder 구조는 encoder와 대칭적으로 구성됨
- transposed convolution을 중심으로 하는 deconvolution layer, Leaky-ReLU activation, 1D convolution, GLU activation으로 구성
- Decoder의 역할
  1. Output feature map이 input feature map과 일치하도록 feature map의 크기를 복원하는 upsampling 역할
    : Tansposed convolution을 통해 upsampling이 수행됨
  2. 추출된 abstract information을 출력을 위한 speech waveform으로 복원
- U-Net 구조와 비슷하게, 각 encoder, decoder module들은 skip connection을 통해 information exchange를 수행
  - 원본 신호에 직접 접근할 수 있는 장점을 제공

- Separator based on Attention Mechanism

Music separation 작업은 입력 신호의 장거리 관계를 캡처하는 것이 중요
- 서로 다른 label 간 상관관계가 크기 때문
  - BUT, CNN encoder의 receptive field 크기는 제한적임
- 따라서, Separator에서 장거리 특징을 캡처할 수 있는 능력을 향상하는 것이 중요
Separator에 attention mechanism을 도입
- Convolution 된 vector representation에는 context-related information이 존재하므로 attention의 사용이 효과적임
  - Downsampling 된 출력에 가중치를 부여할 수도 있음
- DPRNN은 매우 긴 시퀀스를 효과적으로 모델링할 수 있는 구조
  - 긴 입력 시퀀스를 작은 블록으로 나누고, intra-block과 inter-block 연산을 반복적으로 적용
Attention mechanism과 DPRNN에 기반한 새로운 separator 구조 설계
- Segmentation, Dual-path Transformer block, Overlap-add로 구성
- Segmentation 단계
  1. Mix 된 음악 신호 $W \in R N \times L W \in R^{N \times L} <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>W</mi><mo>\in</mo><msup><mi>R</mi><mrow data-mjx-texclass="ORD"><mi>N</mi><mo>\times</mo><mi>L</mi></mrow></msup></math>$ 를 length $K K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi></math>$ 와 hop size $P P <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi></math>$ 크기를 가지는 segment로 분리
    - $N N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi></math>$ : input feature dimension, $L L <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi></math>$ : input audio sequence length
  2. 모든 segement들은 concatenate 되어 3D tensor $D \in R N \times K \times S D \in R^{N \times K \times S} <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>D</mi><mo>\in</mo><msup><mi>R</mi><mrow data-mjx-texclass="ORD"><mi>N</mi><mo>\times</mo><mi>K</mi><mo>\times</mo><mi>S</mi></mrow></msup></math>$ 를 생성
    - $S S <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>S</mi></math>$ : segment의 총 개수
- Separating 단계
  1. Dual-path Transformer를 통해 수행
    - Inter-segment Transformer와 Intra-segment Transformer로 구성
    - 각각 global inforamtion과 local information을 모델링
  2. Segementation 단계에서 얻어진 $D D <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>D</mi></math>$ 가 여러 개의 dual-path separator로 전달됨
    - $B B <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>B</mi></math>$ : dual-path transformer block 수
Intra-segment & Inter-segment Transformer
- Intra-segment Transformer는 $D D <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>D</mi></math>$ 의 두 번째 차원에 대해 적용됨
  $D i n t r a b = I n t r a T r a n s f o r m e r b [D i n t e r b - 1] = T r a n s f o r m e r (D i n t e r b - 1 [:, :, i]), i = 1, 2, . . ., S <math xmlns="http://www.w3.org/1998/Math/MathML"><mtable displaystyle="true" columnalign="right left" columnspacing="0em" rowspacing="3pt"><mtr><mtd></mtd><mtd><msubsup><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>b</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>n</mi><mi>t</mi><mi>r</mi><mi>a</mi></mrow></msubsup><mo>=</mo><mi>I</mi><mi>n</mi><mi>t</mi><mi>r</mi><mi>a</mi><mi>T</mi><mi>r</mi><mi>a</mi><mi>n</mi><mi>s</mi><mi>f</mi><mi>o</mi><mi>r</mi><mi>m</mi><mi>e</mi><msub><mi>r</mi><mrow data-mjx-texclass="ORD"><mi>b</mi></mrow></msub><mo stretchy="false">[</mo><msubsup><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>b</mi><mo>-</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>n</mi><mi>t</mi><mi>e</mi><mi>r</mi></mrow></msubsup><mo stretchy="false">]</mo></mtd></mtr><mtr><mtd></mtd><mtd><mi></mi><mo>=</mo><mi>T</mi><mi>r</mi><mi>a</mi><mi>n</mi><mi>s</mi><mi>f</mi><mi>o</mi><mi>r</mi><mi>m</mi><mi>e</mi><mi>r</mi><mo stretchy="false">(</mo><msubsup><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>b</mi><mo>-</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>n</mi><mi>t</mi><mi>e</mi><mi>r</mi></mrow></msubsup><mo stretchy="false">[</mo><mo>:</mo><mo>,</mo><mo>:</mo><mo>,</mo><mi>i</mi><mo stretchy="false">]</mo><mo stretchy="false">)</mo><mo>,</mo><mi>i</mi><mo>=</mo><mn>1</mn><mo>,</mo><mn>2</mn><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><mi>S</mi></mtd></mtr></mtable></math>$
- 모든 segment의 정보는 $D <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>D</mi></math>$ 의 마지막 차원에 대한 Inter-segment Transformer를 통해 모델링 됨
  $D i n t e r b = I n t e r T r a n s f o r m e r b [D i n t r a b - 1] = T r a n s f o r m e r (D i n t r a b - 1 [:, j, :]), j = 1, 2, . . ., K <math xmlns="http://www.w3.org/1998/Math/MathML"><mtable displaystyle="true" columnalign="right left" columnspacing="0em" rowspacing="3pt"><mtr><mtd></mtd><mtd><msubsup><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>b</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>n</mi><mi>t</mi><mi>e</mi><mi>r</mi></mrow></msubsup><mo>=</mo><mi>I</mi><mi>n</mi><mi>t</mi><mi>e</mi><mi>r</mi><mi>T</mi><mi>r</mi><mi>a</mi><mi>n</mi><mi>s</mi><mi>f</mi><mi>o</mi><mi>r</mi><mi>m</mi><mi>e</mi><msub><mi>r</mi><mrow data-mjx-texclass="ORD"><mi>b</mi></mrow></msub><mo stretchy="false">[</mo><msubsup><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>b</mi><mo>-</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>n</mi><mi>t</mi><mi>r</mi><mi>a</mi></mrow></msubsup><mo stretchy="false">]</mo></mtd></mtr><mtr><mtd></mtd><mtd><mi></mi><mo>=</mo><mi>T</mi><mi>r</mi><mi>a</mi><mi>n</mi><mi>s</mi><mi>f</mi><mi>o</mi><mi>r</mi><mi>m</mi><mi>e</mi><mi>r</mi><mo stretchy="false">(</mo><msubsup><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>b</mi><mo>-</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>n</mi><mi>t</mi><mi>r</mi><mi>a</mi></mrow></msubsup><mo stretchy="false">[</mo><mo>:</mo><mo>,</mo><mi>j</mi><mo>,</mo><mo>:</mo><mo stretchy="false">]</mo><mo stretchy="false">)</mo><mo>,</mo><mi>j</mi><mo>=</mo><mn>1</mn><mo>,</mo><mn>2</mn><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><mi>K</mi></mtd></mtr></mtable></math>$
- $b = 1, 2, . . ., B <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>b</mi><mo>=</mo><mn>1</mn><mo>,</mo><mn>2</mn><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><mi>B</mi></math>$ , $D i n t e r 0 = D <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>D</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>n</mi><mi>t</mi><mi>e</mi><mi>r</mi></mrow></msubsup><mo>=</mo><mi>D</mi></math>$
- Intra-segment Transformer가 intra-segment를 모델링한 다음, Inter-segment Transformer가 block들의 information을 aggregation 하여 utterance-level information 모델링을 수행

BiLSTM을 사용하여 현재 segment에 이전 segment의 정보를 반영할 수 있음
- Transformer를 사용하면 separtor는 context-aware modeling을 도입할 수 있지만, 기존의 transformer 구조는 음악 신호의 시간적 순서를 활용하기 어려움
- Transformer의 음성 embedding에 positional embedding을 추가하면 순서 정보를 반영할 수 있음
  - BUT, Positional embedding은 dual-path 구조에 적합하지 않고 적용 시 모델이 수렴하지 않음
- BiLSTM을 사용하여 feed-forward network의 fully-connected layer를 개선

- Loss Function

대부분의 end-to-end 방식은 L1 Loss나 L2 Loss를 활용함
- 추정값과 target 간의 절댓값 차이가 큰 경우, 최적화를 방해하고 학습의 정확성을 보장할 수 없음
- 입력의 phase가 출력으로 직접 전달되지 않는 경우 phase shift가 발생할 수 있음
일반적인 L1, L2 Loss 대신 Relative Mean Squared Error (RMSE)를 사용하는 것이 좋음
- $x <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi></math>$ 를 ground truth, $ˉ x <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>x</mi><mo stretchy="false">¯</mo></mover></mrow></math>$ 를 추정값이라고 했을 때, MSE는 아래와 같음
  - $E m s e = (x - ˉ x) 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>E</mi><mrow data-mjx-texclass="ORD"><mi>m</mi><mi>s</mi><mi>e</mi></mrow></msub><mo>=</mo><mo stretchy="false">(</mo><mi>x</mi><mo>-</mo><mrow data-mjx-texclass="ORD"><mover><mi>x</mi><mo stretchy="false">¯</mo></mover></mrow><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup></math>$
- MSE를 확장하여 RMSE는 relative factor를 도입해 loss를 계산
  - $Ermse=(|x−ˉx||x|+|ˉx|+ϵ)2<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>E</mi><mrow data-mjx-texclass="ORD"><mi>r</mi><mi>m</mi><mi>s</mi><mi>e</mi></mrow></msub><mo>=</mo><mo stretchy="false">(</mo><mfrac><mrow><mo stretchy="false">|</mo><mi>x</mi><mo>−</mo><mrow data-mjx-texclass="ORD"><mover><mi>x</mi><mo stretchy="false">¯</mo></mover></mrow><mo stretchy="false">|</mo></mrow><mrow><mo stretchy="false">|</mo><mi>x</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mo>+</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mover><mi>x</mi><mo stretchy="false">¯</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mo>+</mo><mi>ϵ</mi></mrow></mfrac><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup></math>$
  - $| | <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">|</mo><mo stretchy="false">|</mo></math>$ : 절댓값, $ϵ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ϵ</mi></math>$ : 0보다 큰 상수항 (0으로 나누어지는 것을 방지)
  - RMSE는 작은 값의 spectral point에 민감

3. Experiments

- Settings

Dataset : MUSDB18HQ, MUSDB18 (드럼, 보컬, 베이스, 그 외에 대한 4개의 label)
Comparisons : Open-Unmix, D3Net, Wave-U-Net, Conv-Tasnet

- Results

SDR 측면에서 제안하는 방식이 다른 비교 모델들보다 좋은 성능을 보임
- T-F domain의 D3Net과 비교했을 때, 제안하는 방식의 성능이 평균 0.15dB 더 높음

추가적인 데이터를 사용해서 실험한 경우에도 제안한 방식이 가장 좋은 성능을 보임

주관적인 분리 품질 평가 역시 제안한 방식이 우수한 성능을 보임

'Paper > Separation' 카테고리의 다른 글

[Paper 리뷰] NAS-TasNet: Neural Architecture Search for Time-Domain Speech Separation (0)	2024.01.07
[Paper 리뷰] Diffusion-Based Generative Speech Source Separation (0)	2024.01.02
[Paper 리뷰] Hybrid Transformers for Music Source Separation (0)	2023.12.21
[Paper 리뷰] On Loss Functions and Evaluation Metrics For Music Source Separation (0)	2023.09.22
[Paper 리뷰] FC-U $2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi></mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup></math>$ -Net: A Novel Deep Neural Network for Singing Voice Separation (0)	2023.09.20

Let IT Begin Voice Engineer | 심심하면 앨범 리뷰 올립니다

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] Attention-based Neural Network for End-to-End Music Separation

Attention-based Neural Network for End-to-End Music Separation

1. Introduction

2. Proposed Network

- Encoder and Decoder

- Separator based on Attention Mechanism

- Loss Function

3. Experiments

- Settings

- Results

'Paper > Separation' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역