[Paper 리뷰] Hybrid Transformers for Music Source Separation

티스토리 뷰

Paper/Separation

[Paper 리뷰] Hybrid Transformers for Music Source Separation

feVeRin 2023. 12. 21. 10:32

Hybrid Transformers for Music Source Separation

Music source separation에서 long range contextual information나 local acoustic feature는 유용하게 사용되는 정보임
Hybrid Transformer Demucs (HT Demucs)
- Hybrid Demucs 기반의 hybrid temporal/spectral bi-U-Net 구조
- Innermost layer를 Transformer Encoder로 대체
- 하나의 domain에 대한 self-attention과 여러 domain 간의 cross-attention을 활용
논문 (ICASSP 2023) : Paper Link

1. Introduction

Music Source Separation (MSS)는 주로 음악을 drum, bass, vocal, other (다른 모든 악기)의 4가지로 분류함
- 이때, MSS를 위해 사용되는 MUSDB18 dataset은 상대적으로 작은 크기의 dataset임
  - 총 150개의 곡으로 구성됨 (training set은 87개 곡)

최근 transformer-based architecture는 여러 분야에서 성공적으로 사용되고 있음
- 특히 MSS 작업에서 short context와 long context 모두 의미 있게 사용될 수 있음
  - Conv-TasNet은 local acoustic feature에 집중하기 위해 1초의 context만을 활용함
  - Demucs는 최대 10초의 context를 활용하여 input의 모호성을 해결함

-> 그래서 MSS에서 context information을 활용하기 위해 transformer architecture를 도입하고, training시 필요한 data 크기에 대한 연구를 수행

Hybrid Transformer Demucs (HT Demucs)
- Hybrid Demucs의 innermost layer를 time과 spectral representation 모두에 적용되는 transformer layer로 대체
- 하나의 domain에 대한 self-attention과 여러 domain에 대한 cross-attention을 활용
- 부족한 training data 문제를 해결하기 위해 MUSDB 외에 추가 dataset을 도입
  - 800개의 곡으로 구성된 내부 dataset을 활용

< Overall of This Paper >

Transformer 기반의 source separation architecture
다양한 Depth, Channels, Context length, Augmentation 설정에 대한 arhictecture 성능 평가

2. Architecture

제안하는 HT Demucs는 Hybrid Demucs를 기반으로 함
- Hybrid Demucs는 2개의 U-Net으로 구성
  1. Temporal convolution을 포함한 time-domain U-Net
  2. Frequency axis에 대한 convolution을 활용하는 spectrogram-domain U-Net
- 이때 각 U-Net은 5개의 encoder layer와 5개의 decoder layer로 구성됨
  - 5-th encoder layer 이후, 두 U-Net의 representation은 shared 6-th layer로 전달되기 전에 합산됨
  - 마찬가지로 첫 번째 decoder layer는 share 되고, 해당 output은 temporal, spectral branch로 전달됨
- Spectral branch의 output은 temporal branch의 output과 합산되기 전에 iSTFT를 사용하여 waveform으로 변환됨
  - 모델의 실제 예측을 제공하는 역할
HT Demucs는 기존 architecture에서 outermost 4개 layer를 그대로 유지함
- 대신 local attention, bi-LSTM을 포함한 encoder, decoder의 innermost 2개 layer를 cross-domain Transformer encoder로 대체
  - 첫 번째 encoder는 spectral branch의 2D signal과 waveform branch의 1D signal을 병렬로 처리
- Cross-domain transformer encoder는 heterogenous data 형태로 동작 가능함
  - Hybrid Demucs와 같이 parameter (STFT, stride, padding 등)을 조절할 필요 없음
- Transformer encoder layer는 self-attention과 feed-forward 이전에 normalization이 적용됨
  - 추가적으로 학습 안정화를 위해 $\epsilon = 10e-4$로 초기화된 Layer Scale과 결합됨
  - 이때, 첫 2개의 normalization은 layer normalization이고, 3번째는 time-layer normalization을 사용
- Cross-attention encoder layer는 spectral, waveform domain에서 self-attention layer와 cross-attention layer를 interleaving 하는 역할
  - 1D, 2D sinusodial encoding이 scaling 된 input에 추가됨
  - Spectral representation을 sequence로 처리하기 위해 reshaping을 적용
Sequence length가 길어지면 memory 사용과 attention speed가 저하됨
- Scale을 확장하고 sparsity pattern을 동적으로 결정하기 위해, Locally Sensitive Hashing (LSH)과 xformer의 sparse attention kernel을 활용
  - 90%의 sparsity level (4개의 bucket으로 LSH를 32회 수행하여 결정됨)
- Sparsity level이 90%가 되는 $k$를 사용하여 LSH의 32 round에 걸쳐 최소 $k$번 일치하는 element를 선택
  -> Sparse HT Demucs로 사용

3. Dataset

학습을 위해 200명의 artist에 대한 3500개의 곡으로 구성된 내부 dataset을 선별함
- 먼저 MUSDB dataset에 대해 Hybrid Demucs를 학습한 다음 내부 dataset에 대한 preprocessing을 수행:
  1. 4가지 source 모두가 전체 시간의 30% 시간 동안 non-silent인 stem만 추출
    - 각 1초 segment에 대해 volume이 -40dB 미만이면 silent로 정의
  2. $i \in \{ drums, bass, others, voclas \}$에서, 각 stem과 Hybrid Demucs $f$를 사용해 $x_{i}$를 나타내는 dataset의 노래 $x$에 대해 $y_{i,j} = f(x_{i})_{j}$를 정의
    - 이는 stem $i$를 분리했을 때 output $j$를 의미함
    - 모든 stem에 대해 완벽하게 label이 정해져 있고, $f$가 완벽한 source separation 모델이라고 했을 때, $y_{i,j}= x_{i}\delta_{i,j}$ (이때, $\delta_{i,j}$ : Kronecker delta)
    - Waveform $z$에 대해, 1초 segment에 걸쳐 측정된 volume을 dB로 정의할 수 있음:
    $V(z) = 10 \cdot log_{10} (AveragePool (z^{2}, 1sec))$
- Source $i, j$의 각 pair에 대해, stem이 있는 1초 segement를 취하고 $P_{i,j}$를 $V(y_{i,j}) -V(x_{i}) > -10dB$인 segment의 비율로 정의하면,
  - Square matrix $P \in [0,1]^{4 \times 4}$를 얻을 수 있고, 완벽한 조건하에 $P= Id$임
  - 모든 source $i$에 대해 $P_{i,i} > 70 \%$이고, source pair $i \neq j$에 대해 $P_{i,j}< 30 \%$인 노래만 선택하면:
  -> 총 800개의 곡을 가진 추가 dataset을 얻을 수 있음

4. Experiments and Results

- Settings

Dataset : MUSDB18 + 추가 내부 dataset
Comparisons : KUIELAB-MDX-Net, Hybrid Demucs, Band-split RNN, Spleeter, D3Net, Demucs v2

- Results

Comparison with the Baselines
- 기반이 된 Hybrid Demucs와 비교하면, HT Demucs는 SDR이 0.45dB 증가함
- Fine tuning, Sparse modeling까지 적용하면 분리 성능이 0.9dB까지 증가함

Impact of the Architecture Hyper-parameters
- 3.4 duration의 경우, transformer encoder의 depth를 증가시키면 SDR이 증가함
- 7.8 duration의 경우, depth 5 / dimension 384를 사용했을 때 성능이 0.6dB 증가함

Impact of the Data Augmentation
- HT Demucs는 더 많은 training dataset을 사용하여 학습되지만 일부 data agumentation을 비활성화하는 경우, SDR이 저하될 수 있음
- 특히 remix augmentation은 제거했을 때 0.7dB의 SDR 저하를 보여 모델 학습에 중요한 요소라고 할 수 있음

Impact of Using Sparse Kernels and Fine Tuning
- Sparse kernel을 활용한 Sparse HT Demucs 모델을 활용하면, 0.14dB의 SDR 향상이 있음
- Fine tuning을 활용하면 학습 시 50 epoch이 더 필요하지만, 0.25dB의 SDR 향상 효과를 얻을 수 있음

'Paper > Separation' 카테고리의 다른 글

[Paper 리뷰] NAS-TasNet: Neural Architecture Search for Time-Domain Speech Separation (0)	2024.01.07
[Paper 리뷰] Diffusion-Based Generative Speech Source Separation (0)	2024.01.02
[Paper 리뷰] Attention-based Neural Network for End-to-End Music Separation (0)	2023.09.23
[Paper 리뷰] On Loss Functions and Evaluation Metrics For Music Source Separation (0)	2023.09.22
[Paper 리뷰] FC-U$^{2}$-Net: A Novel Deep Neural Network for Singing Voice Separation (0)	2023.09.20

최근에 올라온 글

최근에 달린 댓글

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Hybrid Transformers for Music Source Separation

Hybrid Transformers for Music Source Separation

1. Introduction

2. Architecture

3. Dataset

4. Experiments and Results

- Settings

- Results

'Paper > Separation' 카테고리의 다른 글

티스토리툴바