[Paper 리뷰] FC-U$^{2}$-Net: A Novel Deep Neural Network for Singing Voice Separation

티스토리 뷰

Paper/Separation

[Paper 리뷰] FC-U$^{2}$-Net: A Novel Deep Neural Network for Singing Voice Separation

feVeRin 2023. 9. 20. 16:56

FC-U$^{2}$-Net: A Novel Deep Neural Network for Singing Voice Separation

혼합된 음악 신호에서 보컬과 반주(accompainment)를 분리하는 가창 음성 분리를 위한 신경망
FC-U$^{2}$-Net
- 주파수 축을 따라 Time-invariant fully connected layer가 추가된 2단계 중첩 U-Net 구조
- Local/Global contextual information 및 주파수 축에 대한 음성 신호의 장거리 상관관계를 캡처
- 깨끗한 보컬 분리를 위한 ratio mask, binary mask를 결합한 loss function의 사용
논문 (TASLP 2022) : Paper Link

1. Introduction

가창 음성 분리(Singing Voice Separation, SVS)는 관찰된 혼합 음악 신호에서 가창 음성을 분리하는 작업
- Nonnegative matrix factorization, Independent component analysis, Low-rank model
  - 음악 신호의 다양성으로 인해 낮은 일반화 능력을 보임
- Waveform based model & Spectrogram based model
  - Waveform based model은 target waveform을 직접 생성하도록 end-to-end 방식으로 학습
  - Spectrogram based model은 magnitude spectrogram을 추정
- SVS 작업에서는 waveform based model이 유리하지만 많은 데이터를 요구함
  - Short-Time Fourier Transform (STFT)에 의해 생성된 spectrogram을 활용하면 부족한 데이터의 영향을 줄일 수 있음
U-Net은 skip connection이 있는 Encoder-Decoder 구조를 활용해 fine-grained detail을 유지할 수 있도록 함
- U-Net 구조에서 densely-connected CNN은 높은 계산 비용이 필요
- U$^{2}$-Net에서 적용된 2단계 중첩 U-structure는 Dense 블록의 계산 오버헤드를 개선 가능
  - U$^{2}$-Net은 대칭적이고 pretrained 백본이 필요하지 않아 SVS 작업에 상당히 적합

-> 그래서 U$^{2}$-Net 구조를 활용해 SVS 작업을 위한 신경망을 제안

FC-U$^{2}$-Net
- U-Net의 Dense 블록은 local 패턴을 추출함
  - BUT, 오디오의 서로 다른 주파수 대역에 대한 local 패턴은 서로 다름
  - Fully-connected layer는 다양한 대역에 걸친 데이터 분포 변경과 global feature 추출이 가능
- Spectrogram에 기반한 기존 방식들은 Ideal Ratio Mask을 사용해 보컬을 분리
  - BUT, 분리된 보컬에는 여전히 반주가 남아있는 경향

< Overall of FC-U$^{2}$-Net >

Time-invariant fully-connected layer를 추가한 2단계 중첩 U-Net 구조
Ideal Ratio Mask (IRM)과 Ideal Binary Mask (IBM)을 기반으로 한 Loss function 설계

2. Proposed Method

- Spectrogram-Based SVS Framework

STFT를 입력 신호의 시간 영역에 적용해, Magnitude spectrogram과 Phase spectrogram을 얻음
- Training 단계에서는 Magnitude spectrogram만 사용
- Test 단계에서는
  1. 학습된 신경망을 이용하여 human voice에 대한 magnitude spectrogram을 추정
  2. mixture phase와 결합하여 human voice의 시간 영역 신호를 재합성
Notations
- $X(t,f)$ : Mixture 신호의 input magnitude spectrogram
- $Y(t,f)$ : Singing voice 신호의 ground truth magnitude spectrogram
- $\hat{Y}(t,f)$ : 추정된 singing voice 신호에 대한 output magnitude spectrogram
- $t$, $f$ : 각각 time, frequency index

- FC-U$^{2}$-Net

Top level은 11개의 stage (Encoder 6개, Decoder 5개)로 구성되어 큰 U-structure를 형성
- 각 stage는 bottom level U-structure를 형성하는 Residual U-block(RSU)로 구성
- FC-U$^{2}$-Net의 입력은 STFT로 얻은 mixture magnitude sepctrogram
  - $X(C \times T \times F)$ : Channel, Time (frame 수), Frequency (spectrogram의 frequency bin 수)의 3차원으로 구성
- 출력은 보컬의 ratio mask, binary mask를 포함
RSU는 3부분으로 구성된 top-level U-structure의 intermediate block
- Local, global context information을 활용하도록 설계
- Part 1 : Input convolution layer
  - Local feature 추출을 위해 input feature map $X_{1}(C_{in} \times T \times F)$를 intermediate map $X_{2}(C_{out} \times T \times F)$로 변환
- Part 2 : height $L$인 U-Net 스타일의 대칭 encoder-decoder 구조 ($L$ : encoder 수)
  - 입력으로 $X_{2}$를 받아서 $X_{3}(C_{out} \times T \times F)$를 출력
  - U-Net과 동일하게 high resolution feature를 잃지 않고 multi-scale feature를 캡처함
  - 마지막 encoder는 dilated convolution을 사용하지만, 나머지 encoder는 vanilla convolution을 사용
- Part 3 : 2개의 Time-invariant fully-connected layer
  - 첫번째 layer는 입력을 hidden feature space에 mapping
  - 두번째 layer는 internal vector를 original size인 $X_{4}(C_{out} \times T \times F)$로 mapping
  - Hidden unit 수는 $F/4$로 설정
  - Fully-connected layer는 각 시간 프레임에 동일하게 적용 : $X_{4}$의 각 bin에 $X_{3}$의 모든 주파수 대역의 정보를 포함시킴
- Output feature map은 $X_{2}$와 $X_{4}$의 합
RSU는 두 개의 time-invariant fully-connected layer를 사용하여 global 주파수 상관관계를 활용함
- Spectrogram에는 주파수 축을 따라 non-local 상관관계가 존재하기 때문

FC-U$^{2}$-Net은 3 부분으로 구성
- Six-Stage Encoder : $En_{1}, En_{2}, En_{3}, En_{4}, En_{5}, En_{6}$
  - $C_{in}$, $C_{m}$, $C_{out}$ : 각 블록의 input channel, middle channel, output channel
  - $En_{5}$와 $En_{6}$에서 feature map의 resolution이 낮음
  : downsampling에 의한 정보 손실을 방지하기 위해 RSU-4F를 사용하고 dilated convolution을 적용
- Five-Stage Decoder : $De_{1}, De_{2}, De_{3}, De_{4}, De_{5}$
  - Encoder와 대칭적인 구조
  - 각 decoder 단계의 입력
  : 이전 단계의 upsamping feature map과 대칭 ecnoder 단계의 feature map을 concatenate 하여 사용
- Two convolution layer
  - Ratio mask, Binary mask를 생성
  - Ratio mask $\hat{M}_{R}$에는 ReLU activation 적용
  : Magnitude spectrogram이 non-negative 이기 때문
  - Binary mask $\hat{M}_{B}$에는 Sigmoid actiavtion 적용

각 Encoder, Decoder 단계마다 적용되는 RSU block의 구성

- Combined Loss Function

$\hat{Y} = X \bigodot \hat{M}_{R}$ : IRM을 기반으로 추정된 voice magnitude spectrogram
- $\bigodot$ : element-wise multiplication
- $l_{IRM} = \sum_{(t,f)} (\hat{Y}-Y)^{2}$ : IRM 기반 loss
  - 추정된 magnitude $\hat{Y}$와 ground truth magnitude $Y$ 사이의 Mean Squared Error
  - $(t,f)$ : magnitude spectrogram의 각 $T-F$ bin
IRM loss는 분리된 보컬의 청각 품질을 유지할 수 있으나, 반주를 깨끗하게 분리해내지는 못함
- $l_{IBM} = - \sum_{(t,f)} [M_{B} log \hat{M}_{B} + (1-M_{B}) log (1- \hat{M}_{B})]$
  - 분리된 보컬에서 반주의 영향을 제거하기 위해 loss에 IBM을 추가
  - $M_{B}$ : IBM
- 모든 unit $X, Y$에 대해 $M_{B}$는,
  $M_{B}(t,f) = \left \{ \begin{matrix} 1, Y(t,f) \geq 0.5*X(t,f) \\ 0, otherwise \end{matrix} \right.$
최종 Loss function
- $L = l_{IRM} + l_{IBM}$
- 청각 품질과 분리 순수도를 동시에 고려

3. Experiments

- Experimental Setup

Datasets : MUSDB18 (보컬, 드럼, 베이스, 그 외의 4가지 music source)
Comparisions : Dense-U-Net, UMX, TAK1, UHL2, Conv-Tasnet, Demucs

- Results

RSU block을 적용한 U$^{2}$-Net이 Dense-U-Net보다 좋은 성능을 보임
- SVS 작업에서 2단계 중첩 U-Net 구조가 효과적임
- Time-invariant fully-connected layer를 사용했을 때, 보컬과 반주에 대한 SDR, SIR, SAR 모두 향상됨
  - Fully-connected layer가 주파수 차원에 대해 전체 receptive field를 가지면서 global contextual information을 캡처하기 때문
- IRM과 IBM을 결합한 loss를 사용하면, 특히 SIR의 성능이 크게 향상됨
  - 분리된 사운드가 더 순수하고 노이즈가 적다는 것을 의미

IBM loss가 추가되면 IRM loss만 사용할 때 보다 더 깨끗한 보컬 신호를 얻을 수 있음
- Mix 된 신호에서 보컬 없이 반주만 포함된 경우 더 효과적임

(a) Georgia Wonder - 'Siren' / (b) Punkdisco - 'Oral Hygiene' 두 곡의 보컬 신호에 대한 Magnitude Spectrogram

기존 SVS 모델들과 비교했을 때도, FC-U$^{2}$-Net이 가장 좋은 성능을 보임

'Paper > Separation' 카테고리의 다른 글

[Paper 리뷰] NAS-TasNet: Neural Architecture Search for Time-Domain Speech Separation (0)	2024.01.07
[Paper 리뷰] Diffusion-Based Generative Speech Source Separation (0)	2024.01.02
[Paper 리뷰] Hybrid Transformers for Music Source Separation (0)	2023.12.21
[Paper 리뷰] Attention-based Neural Network for End-to-End Music Separation (0)	2023.09.23
[Paper 리뷰] On Loss Functions and Evaluation Metrics For Music Source Separation (0)	2023.09.22

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] FC-U$^{2}$-Net: A Novel Deep Neural Network for Singing Voice Separation

FC-U$^{2}$-Net: A Novel Deep Neural Network for Singing Voice Separation

1. Introduction

2. Proposed Method

- Spectrogram-Based SVS Framework

- FC-U$^{2}$-Net

- Combined Loss Function

3. Experiments

- Experimental Setup

- Results

'Paper > Separation' 카테고리의 다른 글

티스토리툴바