[Paper 리뷰] SUNAC: Source-Aware Unified Neural Audio Codec

티스토리 뷰

Paper/Neural Codec

[Paper 리뷰] SUNAC: Source-Aware Unified Neural Audio Codec

feVeRin 2026. 3. 24. 13:40

SUNAC: Source-Aware Unified Neural Audio Codec

Neural Audio Codec은 multiple source mixture를 entangled manner로 encode 하므로 특정 source의 subset에 access 하는 downstream processing에는 부적합할 수 있음
SUNAC
- Source type prompt에 condition되어 mixture에서 individual source를 encode
- Source-aware codec을 통해 user-driven selection과 separate encoding을 지원
논문 (ICASSP 2026) : Paper Link

1. Introduction

Neural Audio Codec (NAC)는 audio signal을 discrete token으로 convert 함
- 이를 위해 SoundStream, DAC는 Generative Adversarial Network (GAN)-based training, convolutional encoder/decoder, Residual Vector Quantization (RVQ) module을 활용함
- BUT, 기존 NAC는 source awareness 없이 training 되므로 mixture를 disentangle 하지 못함
  - 따라서 single source에 대한 downstream task에 suboptimal 함

-> 그래서 single source-aware neural codec인 SUNAC을 제안

SUNAC
- Latent space에서 prompt-based source feature extraction을 수행한 다음, quantizer를 통해 separated feature에서 code를 estimate
- 동일한 type의 multiple source를 process 하기 위해 Permutation Invariant Training (PIT)를 도입하고 prompt mechanism을 통해 source에 대한 pre-defined cap을 제거

< Overall of SUNAC >

Prompt-based feature extraction과 PIT를 활용한 source-aware neural codec
결과적으로 기존보다 우수한 성능을 달성

2. Method

- Problem Setup

각 source는 prompt $p_{n}\in\mathcal{T}$의 source type과 associate 되고, $N\geq 1$ source $\mathbf{s}_{n}^{(p_{n})}$으로 구성된 input waveform $\mathbf{x}=\sum_{n}\mathbf{s}_{n}^{(p_{n})}\in\mathbb{R}^{L}$이 주어진다고 하자
- 여기서 논문은 source를 speech, music, SFX, 해당 mixture 중 하나로 간주하여 $\mathcal{T}=\{\texttt{<Speech>}, \texttt{<Music>},\texttt{<SFX>},\texttt{<Mix>}\}$로 정의함
- 결과적으로 SUNAC은 해당 desired prompt set으로 specify 된 source에 대한 code를 추출하는 것을 목표로 함

- SDCodec

SDCodec은 $N=3$이고 각 하나의 speech, music, SFX source가 있는 restricted version을 고려함
- 특히 SDCodec은 DAC를 extend 하여 source-aware RVQ module을 convolutional encoder 뒤에 insert 함
  1. 이를 통해 모든 source가 common encoder-derived feature space를 share 하도록 하여 per-source quantization pathway를 구성하고, source feature 간 orthogonality를 보장함
  2. 추가적으로 per-source quantized feature의 summation을 decoding 하여 mixture를 reconstruct 함
- 이와 달리 SUNAC은 prompt-conditioned feature space를 estimate 하고, $\texttt{<Mix>}$ prompt를 model에 prompting 하여 mixture를 directly reconstruct 함

- Separation and NAC Cascade

Mixture를 source-specific waveform으로 separate 한 다음, 이를 encoding 하는 cascaded system을 고려할 수 있음
- 이를 위해 mixture waveform을 STFT, band-split encoder로 encode 하는 TUSS를 front-end로 사용함
  - 이때 encoded feature와 learnable prompt는 TF-Locoformer를 통해 transform 됨
- 이후 transformed feature는 transformed feature와의 element-wise multiplication을 통해 condition 되고, conditioned feature는 TF-Locoformer를 통해 refine 된 다음, inverse band-split과 iSTFT를 사용해 time-domain으로 mapping 됨

- SUNAC

Redundant processing을 방지하기 위해 feature space에 conditional feature extractor를 적용하여 cascade system의 explicit separation을 replace 함
- 먼저 feature dimension $F$, frame 수 $T$에 대해, encoder는 input waveform $\mathbf{x}$를 continuous time-frequency (TF)-like representation $\mathbf{X}\in\mathbb{R}^{F\times T}$로 mapping 함
  1. Conditional feature extractor는 learnable prompt를 기반으로 separated TF-representation을 estimate 함
  2. Quantizer, decoder는 source-agnostic 하고 모든 source에 대해 share 됨
    - Quantizer는 projection을 포함한 multi-layer RVQ를 사용하여 separated TF representation을 discretize 함
    - Decoder는 quantized TF feature을 input으로 prompt 수 $N$에 대해 waveform $\hat{\mathbf{s}}\in\mathbb{R}^{N\times L}$을 estimate 함
- Conditional feature extractor는 cross-prompt, conditioning, target-source extraction module로 구성됨
  1. Cross-prompt module은 encoded TF representation $\mathbf{X}\in\mathbb{R}^{T\times F}$와 $N$ prompt $(p_{n})_{n}$에 해당하는 learnable prompt vector $\mathbf{P}\in\mathbb{R}^{F\times N}$을 사용함
    - Multi-source mixture reconstruction을 위해 model은 $\texttt{<Mix>}$가 주어졌을 때 mixture를 reconstruct 하도록 training 됨
  2. 이후 $N$ prompt를 time axis를 따라 $\mathbf{X}$에 concatenate하고 Transformer를 적용한 다음, first $N$ token을 split off 하여 transformed prompt $\mathbf{P}'\in\mathbb{R}^{F\times N}$과 transformed feature $\mathbf{X}'\in\mathbb{R}^{F\times T}$를 얻음
    - 이때 positional encoding과 self-attention을 사용하면 동일한 content지만 서로 다른 position을 가지는 prompt는 서로 다른 representation을 produce 함
  3. 결과적으로 input TF feature는 prompt를 따라 conditional extraction을 위한 space로 mapping 됨
- Conditioning module은 residual connection과 transformed prompt $\mathbf{P}'_{n}\in\mathbb{R}^{F}$를 사용하여 transformed feature $\mathbf{X}'$에 대해 Feature-wise Linear Modulation (FiLM)을 적용함
  1. 여기서 trainable function $f,h$는 모든 prompt에 share 되는 simple linear transformation에 해당함
  2. 그러면 FiLM output은 다음과 같이 얻어짐:
    (Eq. 1) $ \text{FiLM}(\mathbf{X}'|\mathbf{P}'_{n})=f(\mathbf{P}'_{n})\odot \mathbf{X}'+h(\mathbf{P}'_{n})$
    - $\odot$ : element-wise product

- Training Objective

SUNAC은 permutation-invariant objective를 통해 training됨
- 이때 objective는:
  (Eq. 2) $ \mathcal{L}_{SUNAC}=\min_{\pi\in\tilde{\mathcal{P}}_{S}}\sum_{i=1}^{S}\mathcal{L}_{DAC} \left(s_{i},\hat{s}_{\pi(i)}\right)+\mathcal{L}_{DAC}\left(s_{mix},\hat{s}_{mix}\right)$
  - $s_{i}, \hat{s}_{\pi (i)}$ : $i$-th ground-truth source/estimate, $s_{mix}, \hat{s}_{mix}$ : ground-truth/estimated mixture
  - $S$ : source 수, $\tilde{\mathcal{P}}_{S}$ : same type의 prompt에 해당하는 index만 permute 하는 $\{1,...,S\}$ permutation의 subset
- $\pi\in\tilde{\mathcal{P}}_{S}$는 multiple $\texttt{<Speech>}$ estimate를 appropriate reference와 align 할 수 있음
- $\mathcal{L}_{DAC}$는 DAC loss로써 multi-scale mel-spectrogram loss, adversarial loss, codebook loss, commitment loss, discriminator loss의 weighted sum으로 얻어짐
  - 이때 HiFi-GAN의 Multi-Period Discriminator, UnivNet의 complex Multi-Scale STFT Discriminator를 사용함
- DAC의 모든 component를 모든 permutation에 대해 evaluate 하는 것은 computationally prohibitive 함
  - 따라서 논문은 SI-SDR-based criterion을 사용하여 permutation을 determine 하고 해당 SI-SDR criterion을 minimize 하는 output-reference assignment에 대해서만 DAC loss를 compute 함
  - 그러면 해당 loss는 다음과 같이 얻어짐:
    (Eq. 3) $\pi^{*}=\arg\max_{\pi\in\tilde{\mathcal{P}}_{S}}\sum_{i=1}^{S}\text{SI-SDR}(s_{i},\hat{s}_{\pi(i)})$
    (Eq. 4) $\mathcal{L}_{SUNAC}=\sum_{i=1}^{S}\mathcal{L}_{DAC}\left(s_{i},\hat{s}_{\pi^{*}(i)}\right) + \mathcal{L}_{DAC}\left(s_{mix},\hat{s}_{mix}\right)$

3. Experiments

- Settings

Dataset : Divide and Remaster
Comparisons : DAC, SDCodec

- Results

전체적으로 SUNAC이 우수한 성능을 보임

$\{\texttt{<Speech>}, \texttt{<Music>},\texttt{<SFX>}\}$의 mixture, separated source에 대해서도 우수한 reconstruction이 가능함

Reconstruction: $\{\texttt{<Speech>}, \texttt{<Music>},\texttt{<SFX>}\}$

$\{\texttt{<Speech>}, \texttt{<Speech>}\}$의 2-speaker separation에서도 뛰어난 성능을 달성함

$\{\texttt{<Speech>}, \texttt{<Speech>},\texttt{<Music>},\texttt{<SFX>}\}$ setting에서도 우수한 성능을 보임

Reconstruction: $\{\texttt{<Speech>}, \texttt{<Speech>},\texttt{<Music>},\texttt{<SFX>}\}$

'Paper > Neural Codec' 카테고리의 다른 글

[Paper 리뷰] CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate (0)	2026.03.19
[Paper 리뷰] SwitchCodec: Adaptive Residual-Expert Sparse Quantization for High-Fidelity Neural Audio Coding (0)	2026.03.10
[Paper 리뷰] FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation (0)	2026.03.03
[Paper 리뷰] Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration Coding (0)	2026.02.11
[Paper 리뷰] Scaling Transformers for Low-Bitrate High-Quality Speech Coding (0)	2026.01.29

최근에 올라온 글

최근에 달린 댓글

« 2026/03 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] SUNAC: Source-Aware Unified Neural Audio Codec

SUNAC: Source-Aware Unified Neural Audio Codec

1. Introduction

2. Method

- Problem Setup

- SDCodec

- Separation and NAC Cascade

- SUNAC

- Training Objective

3. Experiments

- Settings

- Results

'Paper > Neural Codec' 카테고리의 다른 글

티스토리툴바