[Paper 리뷰] SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound

티스토리 뷰

Paper/Neural Codec

[Paper 리뷰] SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound

feVeRin 2025. 11. 18. 13:07

SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound

대부분의 neural codec은 high bitrate에서 동작하고 narrow domain을 가짐
SemantiCodec
- Speech, general sound, music 등의 다양한 domain을 100 token/sec 이하의 token으로 compress
- $k$-means clustering을 통해 discretize 된 Self-Supervised Pre-Trained Audio Masked AutoEncoder와 acoustic encoder로 구성된 dual-encoder architecture를 활용
논문 (JSTSP 2024) : Paper Link

1. Introduction

EnCodec, DAC, HiFi-Codec과 같은 neural audio codec은 Vector Quantization (VQ)를 활용하여 compact codebook을 학습하고 해당 codebook index를 transmit 함
- 특히 최근에는 AudioLM, MusicLM, AudioGen, VALL-E 등과 같은 audio language model을 위해 neural audio codec이 주로 활용됨
- BUT, audio language model은 autoregressive nature로 인해 추론 시간이 codec token rate에 의존적임
  1. 특히 long sequence는 많은 computational resource를 요구하고 long-term dependency 문제를 발생시킴
  2. 이를 해결하기 위해 low-bitrate audio codec을 고려할 수 있지만 strong artifact로 인해 reconstruction quality가 저하될 수 있음
    - 결과적으로 low-bitrate에서도 high-quality reconstruction을 보장할 수 있는 audio codec이 필요함
- 추가적으로 효과적인 language model learning을 위해서는 semantic richness를 확보해야 함
  - BUT, 기존 neural codec은 adequate semantic information을 capture 하지 못함

-> 그래서 low-bitrate에서 high reconstruction quality와 semantic richness를 보장하는 SemantiCodec을 제안

SemantiCodec
- Mel-spectrogram을 2개의 encoder를 통해 sequentially process 하고 2개의 distinct VQ layer를 적용
  1. First VQ layer는 semantic information을 capture 하기 위해 Audio Masked AutoEncoder (AudioMAE) feature dataset에 대한 $k$-means clustering을 통해 derive 된 centroid를 활용하여 구성됨
  2. Second VQ layer는 learnable VQ mechanism을 활용하여 reconstruction fidelity를 향상함
- 이후 각 VQ layer의 quantized output을 concatenate 하고 Latent Diffusion Model (LDM)을 활용하여 high-quality audio reconstruction을 지원

< Overall of SemantiCodec >

Self-Supervised AudioMAE가 학습한 rich audio representation과 LDM을 활용한 low-bitrate semantic codec
결과적으로 기존보다 우수한 성능을 달성

2. Method

- Overview

Audio sample length $l$에 대해 input audio $\mathbf{x}\in\mathbb{R}^{l}$이 주어지면, temporal, frequency dimension $T,F$에 대해 mel-spectrogram $\mathbf{X}\in\mathbb{R}^{T\times F}$로 transform 함
- 다음으로 pre-trained AudioMAE $\mathcal{A}(\cdot)$을 활용하여 AudioMAE feature $\tilde{\mathbf{Y}}=\mathcal{A}(\mathbf{X})=[\tilde{\mathbf{y}}_{1},\tilde{\mathbf{y}}_{2},...,\tilde{\mathbf{y}}_{L}]\in\mathbb{R}^{L\times E}$를 compute 함
  - $L=\frac{TF}{P^{2}}$ : patch embedding vector 수, $P,E$ : 각각 AudioMAE의 patch size, embedding size
- 각 patch는 AudioMAE로 처리된 mel-spectrogram에 대한 distinct, non-overlapping block에 해당하고, multiple patch는 AudioMAE encoder의 input을 구성함
- Quantization 이후 bitrate에 영향을 주는 patch embedding vector 수를 줄이기 위해, 논문은 $\tilde{\mathbf{Y}}$의 adjacent vector를 $\mathbf{Y}=[\mathbf{y}_{1},\mathbf{y}_{2},...,\mathbf{y}_{\frac{L}{K}}]\in\mathbb{R}^{\frac{L}{K}\times KE}$로 aggregate 함
  1. 이때 $K\in\{1,2,4\}$는 stack factor를 의미하고 $i\in\{0,1,...,\frac{L}{K}-1\}$에 대해 $\mathbf{y}_{i}=[\tilde{\mathbf{y}}_{iK},...,\tilde{\mathbf{y}}_{(i+1)K-1}]$와 같음
  2. Vector $\mathbf{y}_{i}$에 대해 extensive clustering을 수행한 다음, semantic codebook $\mathbb{E}_{s}=[\mathbf{e}_{1},\mathbf{e}_{2},...,\mathbf{e}_{N}]$을 derive 할 수 있고, 이를 semantic clustering이라고 함
    - $N_{s}$ : semantic codebook entry 수
- Stacked feature $\mathbf{Y}$는 $\mathbb{E}_{s}$의 initial quantization을 통해 semantic token $\mathbf{c}_{s}$, semantic feature $\mathbf{E}_{s}\in\mathbb{R}^{\frac{L}{K}\times KE}$로 변환됨
  1. 이후 $\mathbf{Y}, \mathbf{E}_{s}$를 concatenate 하고 acoustic encoder $\mathcal{F}(\cdot)$을 통해 acoustic feature $\mathbf{Y}_{A}$를 compute 함
    - 이는 entry $\mathbb{E}_{a}\in\mathbb{R}^{N_{a}\times KE}$의 acoustic VQ layer를 통해 quantize 되어 acoustic token $\mathbf{c}_{a}$, quantized acoustic feature $\mathbf{E}_{a}$를 output 함
  2. 결과적으로 input audio $\mathbf{x}$에 대한 final token은 semantic, acoustic token의 merge로써 $\mathbf{c}=[\mathbf{c}_{s},\mathbf{c}_{a}]$로 얻어짐
- SemantiCodec decoder는 quantized semantic, acoustic feature를 concatenate 한 $\mathbf{E}=[\mathbf{E}_{s},\mathbf{E}_{a}]$에 condition 된 LDM을 활용함
  - LDM estimation은 pre-trained VAE decoder와 mel-spectrogram vocoder를 통해 waveform으로 further decoding 되고, acoustic encoder $\mathcal{F}(\cdot)$은 acoustic codebook $\mathbb{E}_{a}$, LDM에 대해 joint-optimize 됨

- Semantic Clustering

AudioMAE feature는 semantic, acoustic information을 효과적으로 preserve하므로 reconstruction에 유리함
- 따라서 논문은 AudioMAE feature를 SemantiCodec encoder input으로 사용하여 semantic content를 보장하면서 audio reconstruction quality를 향상함
- 먼저 mel-spectrogram $\mathbf{X}\in\mathbb{R}^{T\times F}$가 주어지면 AudioMAE는 $\mathbf{X}$를 $P\times P$ dimension의 patch로 transform 함
  1. 해당 patch는 AudioMAE encoder input을 구성하는데 사용되고, AudioMAE encoder output $\mathbf{Y}_{0}$는 $\frac{T}{P}\times \frac{F}{P}\times E$ dimension을 가짐
    - 이는 length $L=\frac{TF}{P^{2}}$과 embedding dimension $E$를 가지는 tensor sequence로 볼 수 있음
  2. 이후 $\mathbf{Y}_{0}$의 adjacent $K$ frame을 stack 하여 stacked AudioMAE feature vector $\mathbf{Y}=\{\mathbf{y}_{1},\mathbf{y}_{2},...,\mathbf{y}_{\frac{L}{K}}\}$를 얻은 다음, 이를 SemantiCodec encoder input과 semantic clustering에 사용함
- 논문은 AudioMAE feature vector $\mathbf{y}_{i}$에 대해 semantic quantization을 수행하기 위해, $k$-means clustering을 사용함
  1. BUT, single $k$-means clustering은 suboptimal 하므로 HuBERT를 따라 cluster ensemble를 채택함
  2. 즉, speech, music, general sound의 3가지 category에 대해 distinct $k$-means model을 training 하고, 해당 domain에서 얻어진 codebook ($k$-means centroid)를 combine 하여 ensembled codebok을 구성함

- SemantiCodec Encoder

SemantiCodec의 encoding process는 stacked AudioMAE feature $\mathbf{Y}$를 input으로 하여 token $\mathbf{c}=[\mathbf{c}_{s},\mathbf{c}_{a}]$와 latent feature $\mathbf{E}=[\mathbf{E}_{s},\mathbf{E}_{a}]$를 compute 함
Semantic Encoder
- $N_{s}$ semantic codebook centroid $\mathbf{e}_{j}\in\mathbb{E}_{s}=\{\mathbf{e}_{1},\mathbf{e}_{2},...,\mathbf{e}_{N}\}$에 대해 semantic quantization process는:
  (Eq. 1) $ c_{s}(i)=\arg\min_{j\in\{1,...,N_{s}\}}||\mathbf{y}_{i}-\mathbf{e}_{j}||^{2},\,\,\, \mathbf{E}_{s}(i)=\mathbf{e}_{c_{s}(i)}$
  - $i\in\{0,1,...,\frac{L}{K}-1\}$, $c_{s}(i)$ : semantic codebook $\mathbb{E}_{s}$에서 feature vector $\mathbf{y}_{i}$에 대한 closest centroid index, $\mathbf{E}_{s}(i)$ : centroid vector
- 해당 quantization step은 각 high-dimensional feature vector $\mathbf{y}_{i}$를 discrete semantic token $c_{s}(i)$와 해당 quantized semantic feature $\mathbf{E}_{s}(i)$로 mapping 함
Acoustic Encoder
- Quantized AudioMAE feature $\mathbf{E}_{s}$는 rich semantic information을 encapsulate 하고 있지만 $\mathbf{E}_{s}$만 사용하는 경우 suboptimal reconstruction으로 이어짐
- 이를 해결하기 위해 논문은 discrete acoustic detail-oriented representation을 capture 하는 additional acoustic encoder module을 도입함
  1. Acoustic quantization을 위한 input feature는 acoustic encoder $\mathcal{F}_{\Phi}(\cdot)$을 통해 얻어짐
  2. 이때 해당 encoder는 semantic quantization 전후의 AudioMAE feature를 input으로 사용하여 acosutic feature를 output 함:
    (Eq. 2) $\mathbf{Y}_{A}=\mathcal{F}_{\Phi}([\mathbf{Y},\mathbf{E}_{s}])\in\mathbb{R}^{\frac{L}{K} \times EK}$
    - $\Phi$ : acoustic encoder의 trainable parameter
  3. $\mathbf{E}_{s}$는 semantic quantization 시 information loss가 발생하지만, $\mathbf{Y}$는 reconstruction을 위한 information을 retain 하므로 $\mathbf{Y},\mathbf{E}_{s}$ 모두를 input으로 사용함
    - 구조적으로는 BiLSTM을 채택함
- 논문은 $\mathbf{Y}_{A}$의 quantization을 사용하여 detailed acoustic nuance를 compact, discrete format으로 변환함
  1. Individual codebook vector $\mathbf{e}_{i}\in\mathbb{R}^{EK}$, codebook entry 수 $N_{a}$에 대해 acoustic codebook을 $\mathbb{E}_{a}=\{\mathbf{e}_{1},\mathbf{e}_{2},...,\mathbf{e}_{N_{s}}\}$라고 하자
  2. 각 vector $\mathbf{y}_{a,i}\in\mathbb{Y}_{A}$에 대해 quantization process는:
    (Eq. 3) $c_{a}(i)=\arg\min_{j\in \{1,...,N_{a}\}}||\mathbf{y}_{a,i}-\mathbf{e}_{j}||^{2},\,\,\, \mathbf{E}_{a}(i)=\mathbf{e}_{c_{a}(i)}$
    - $i\in \{0,1,...,\frac{L}{K}-1\}$, $c_{a}(i)$ : feature vector $\mathbf{y}_{a,i}$의 acoustic codebook $\mathbb{E}_{a}$ 내의 nearest centroid에 대한 index, $\mathbf{E}_{a}(i)$ : quantized vector
  3. Acoustic encoder가 codebook entry와 closely match 되는 representation을 생성할 수 있도록, 논문은 다음의 commitment loss를 도입함:
    (Eq. 4) $\mathcal{L}_{commit}=\sum_{i}||\mathbf{y}_{a,i}=\mathbf{E}_{a}(i)||^{2}$
  4. 추가적으로 HiFi-Codec을 따라 codebook update를 위해 Exponential Moving Average (EMA)를 도입함
- Final token $\mathbf{c}$와 representation $\mathbf{E}$는 semantic quantization layer와 acoustic quantization layer output을 concatenate 하여 얻어짐:
  (Eq. 5) $\mathbf{c}=[\mathbf{c}_{s},\mathbf{c}_{a}]\in\mathbb{N}^{\frac{2L}{K}},\,\,\, \mathbf{E}=[\mathbf{E}_{s},\mathbf{E}_{a}]\in\mathbb{R}^{\frac{L}{K}\times 2EK}$

- Latent Diffusion Model for Reconstruction

논문은 origianl audio $\mathbf{x}$를 reconstruct 하기 위해 LDM을 decoder로 사용함
- 여기서 LDM은 VAE의 latent space에서 data distribution을 modeling 함
  - VAE encoder는 SemantiCodec training 시에만 사용되고 추론 시에는 discard 됨
- 기존 diffusion model과 비교하여 LDM은 high-dimensional spectrogram $\mathbf{X}$을 low-dimensional latent $\mathbf{z}_{0}$으로 compress 하여 computation을 줄임
  1. 이후 diffusion model을 training 하여 Gaussian noise로부터 $\mathbf{z}_{0}$을 gradually generate 함
  2. Forward diffusion process는 $N$ Markov transition step으로 구성되고, 각 step에서 noise injection을 통해 $\mathbf{z}_{0}$를 Gaussian distribution으로 transform 함
  3. 이때 forward step $n-1$은:
    (Eq. 6) $ q(\mathbf{z}_{n}|\mathbf{z}_{n-1})=\sqrt{1-\beta_{n}}\mathbf{z}_{n-1}+\sqrt{\beta_{n}}\epsilon_{n}$
    - $\beta_{n}$ : pre-defined noise schedule
  4. 해당 forward step을 composite 하여 initial $\mathbf{z}_{0}$에 대한 step $n$의 closed-form distribution을 얻을 수 있음:
    (Eq. 7) $q(\mathbf{z}_{n}|\mathbf{z}_{0})=\sqrt{\bar{\alpha}_{n}}\mathbf{z}_{0} +\sqrt{1-\bar{\alpha}_{n}}\epsilon_{n}$
    - $\alpha_{n}=1-\beta_{n}, \bar{\alpha}_{n}=\prod_{n=1}^{n}\alpha_{n}, \epsilon\sim\mathcal{N}(0,I)$
- 충분한 diffusion step $N$이 주어지면 $q(\mathbf{z}_{n})$은 standard Gaussian distribution $\mathcal{N}(0,I)$에 approximate 함
  1. 그러면 LDM은 SemantiCodec encoder output $\mathbf{E}$에 condition 된 reverse probability $p_{\theta}(\mathbf{z}_{n-1}|\mathbf{z}_{n},\mathbf{E})$를 modeling 하도록 training 됨
  2. 이때 $\mathbf{E}_{a}$가 $\mathbf{E}_{s}$의 information을 일부 포함할 수 있지만, diffusion model에 두 information을 모두 전달하여 quantization layer가 $\mathbf{Y}$로부터 information을 collaboratively capture 할 수 있도록 함
- 일반적인 noise scheduling에서 last forward diffusion step의 noisy latent $\mathbf{z}_{N}$는 Gaussian distribution을 따르지 않음
  1. 이를 해결하기 위해 논문은 cosine noise schedule을 도입하여 last step이 standard Gaussian distribution을 따르도록 보장함
    - 추가적으로 sampling process를 stabilize 하기 위해 velocity prediction을 적용함
  2. 그러면 LDM training loss는:
    (Eq. 8) $\mathbf{v}_{n}=\sqrt{\bar{\alpha}_{n}}\epsilon -\sqrt{1-\bar{\alpha}_{n}}\mathbf{z}_{0}$
    (Eq. 9) $\mathcal{L}_{recon}=||\mathbf{v}_{n}-\mathcal{G}_{\theta}(\mathbf{z}_{n},n,\mathbf{E})||^{2}$
    - $\mathcal{G}_{\theta}$ : LDM, $\theta$ : trainable parameter
- 추론 시에는 Denoising Diffusion Implicit Model (DDIM) sampler를 사용하고, audio $\hat{\mathbf{x}}$는 pre-trained VAE decoder와 HiFi-GAN vocoder를 통해 reconstruct 됨
- 추가적으로 더 나은 reconstruction을 위해 논문은 Classifier-Free Guidance (CFG)를 도입함
  1. 먼저 (Eq. 9)에서 condition $\mathbf{E}$는 training 시 일정 probability로 discard 되고, 이를 통해 conditional model $\mathbf{v}_{\theta}(\mathbf{z}_{n},n,\mathbf{E})$와 unconditional model $\mathbf{v}_{\theta}(\mathbf{z}_{n},n)$이 multi-task paradigm으로 optimize 됨
  2. Sampling 시 original $\mathbf{v}_{\theta}(\mathbf{z}_{n},n,\mathbf{E})$는 conditional, unconditional model이 predict 한 velocity의 weighted combination으로 complement 됨:
    (Eq. 10) $ (1-w)\cdot \mathbf{v}_{\theta}(\mathbf{z}_{n},n,\mathbf{E})+w\cdot \mathbf{v}_{\theta}(\mathbf{z}_{n},n)$
    - $w$ : guidance scale

- Training Objective

논문은 pre-trained AudioMAE, VAE, vocoder parameter를 freeze 하여 사용함
- $k$-means semantic clustering centroid는 LDM training 이전에 얻어지고 freeze 됨
- Acoustic encoder, acoustic VQ layer, LDM은 acoustic VQ layer의 commitment loss와 reconstruction loss의 summation으로 jointly optimize 됨:
  (Eq. 11) $\mathcal{L}=\mathcal{L}_{recon}+\mathcal{L}_{commit}$

3. Experiments

- Settings

Dataset : GigaSpeech, Million Song Dataset, MedleyDB, MUSDB18, AudioSet, VGGSound
Comparisons : EnCodec, DAC, HiFi-Codec

- Results

전체적으로 SemantiCodec은 우수한 reconstruction 성능을 보임

MUSHRA test 측면에서도 뛰어난 결과를 보임

서로 다른 domain에서도 우수한 성능을 달성함

Mel-spectrogram 측면에서도 더 나은 reconstruction을 보임

Semantic in the Codec Tokens
- Semantic richness 측면에서도 SemantiCodec이 다른 codec 보다 더 나은 성능을 보임

Variable Semantic Codebook Size
- Variable vocabulary size를 사용하면 ViSQOL이 향상되지만 WER이 크게 저하됨

Centroid 수가 증가하면 $k$-means modeling으로 인한 quantization error가 감소함

더 많은 $k$-means centroid를 사용할수록 reconstruction quality가 향상됨

Acoustic Representation Learning
- Acoustic codebook size가 클수록 더 나은 reconstruction이 가능함

Learnable Semantic Codebook
- Learnable codebook은 $k$-means centroid에 비해 더 낮은 성능을 보임

DDIM Sampling Setups
- CFG guidance scale $w$가 너무 작으면 충분한 condition-oriented guidance를 제공할 수 없음

'Paper > Neural Codec' 카테고리의 다른 글

[Paper 리뷰] TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling (0)	2025.11.11
[Paper 리뷰] FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks (0)	2025.11.05
[Paper 리뷰] PAST: Phonetic-Acoustic Speech Tokenizer (0)	2025.09.24
[Paper 리뷰] Factorized RVQ-GAN for Disentangled Speech Tokenization (0)	2025.09.22
[Paper 리뷰] LSPNet: An Ultra-Low Bitrate Hybrid Neural Codec (0)	2025.09.16

최근에 올라온 글

최근에 달린 댓글

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound

SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound

1. Introduction

2. Method

- Overview

- Semantic Clustering

- SemantiCodec Encoder

- Latent Diffusion Model for Reconstruction

- Training Objective

3. Experiments

- Settings

- Results

'Paper > Neural Codec' 카테고리의 다른 글

티스토리툴바