[Paper 리뷰] CLAP: Learning Audio Concepts from Natural Language Supervision

티스토리 뷰

Paper/Representation

[Paper 리뷰] CLAP: Learning Audio Concepts from Natural Language Supervision

feVeRin 2025. 7. 30. 17:00

CLAP: Learning Audio Concepts from Natural Language Supervision

Restricted supervision 하에서 training 된 audio model은 flexibility의 한계가 있음
CLAP
- Natural language supervision을 통해 audio concept을 학습
- 2개의 encoder와 contrastive learning을 활용하여 audio, text description을 joint multimodal space로 modeling
논문 (ICASSP 2023) : Paper Link

1. Introduction

대부분의 audio model은 specific task의 pre-defined category와 audio recording에 대해서만 training 되므로 restricted supervision으로 인한 flexibility의 문제가 있음
- 한편으로 Wav2Vec 2.0, WavLM과 같은 Self-Supervised Learning (SSL)은 unlabeled audio를 통해 model을 pre-train 하여 limited supervision 문제를 avoid 할 수 있음
  - 이때 pre-trained SSL model은 class label paradigm을 따라 downstream task에 adapt 됨
- BUT, 해당 SSL model 역시 pre-defined category만 predict 하는 static output layer를 가지므로 unseen category에 대한 zero-shot prediction을 수행하기 어려움

-> 그래서 zero-shot audio prediction을 위해 natural language supervision을 활용한 CLAP을 제안

CLAP
- Natural language와 audio를 connect 하기 위해 2개의 encoder와 contrastive learning을 활용하여 audio, text description을 joint multimodal space에서 modeling
- Zero-shot prediction을 통해 flexible class prediction을 지원

< Overall of CLAP >

Contrastive learning과 natural language supervision을 활용한 audio classification model
결과적으로 다양한 downstream task에서 우수한 성능을 달성

2. Method

CLAP은 audio, text pair를 input으로 하여 audio encoder와 text encoder에 전달함
- 이후 각 representation은 linear projection을 통해 joint multimodal space에 concatenate 됨
  - 해당 space는 contrastive learning을 사용하여 audio, text pair의 similarity에 따라 학습됨
- Projection layer를 포함한 pre-trained encoder는 audio, text embedding을 compute 하고 zero-shot classification을 위해 사용될 수 있음

- Contrastive Language-Audio Pre-Training

Mel bin 수 $F$, time $T$에 대해 processed audio를 $X_{a}\in\mathbb{R}^{F\times T}$, text를 $X_{t}$라고 하자
- Batch $N$에서 각 audio-text pair는 $\{X_{a},X_{t}\}_{i}$와 같이 represent 됨
  - $i\in[0,N]$
- Pair에서 audio, text는 각각 audio encoder, text encoder를 통해 전달됨
  1. $N$ batch에 대해 audio encoder를 $f_{a}(\cdot)$, text encoder를 $f_{t}(\cdot)$이라고 하면:
    (Eq. 1) $\hat{X}_{a}=f_{a}(X_{a});\,\,\,\hat{X}_{t}=f_{t}(X_{t})$
    - $\hat{X}_{a}\in\mathbb{R}^{N\times V}$ : dimensionality $V$의 audio representation
    - $\hat{X}_{t}\in\mathbb{R}^{N\times U}$ : dimensionality $U$의 text representation
  2. 이때 논문은 audio, text representation $\hat{X}_{a},\hat{X}_{t}$를 learnable linear projection을 사용하여 dimension $d$의 joint multimodal space로 가져옴:
    (Eq. 2) $E_{a}=L_{a}(X_{a});\,\,\,E_{t}=L_{t}(X_{t})$
    - $E_{a}\in\mathbb{R}^{N\times d}, E_{t}\in\mathbb{R}^{N\times d}$
    - $L_{a},L_{t}$ : 각각 audio, text에 대한 linear projection
  3. 그러면 audio, text embedding $(E_{a},E_{t})$가 comparable 하므로 similarity를 compute 할 수 있음:
    (Eq. 3) $C=\tau *(E_{t}\cdot E_{a}^{\top})$
    - $\tau$ : temperature parameter
    - Similarity matrix $C\in \mathbb{R}^{N\times N}$은 diagonal에서 $N$ correct pair를 가지고, off-diagonal에서 $N^{2}-N$ incorrect pair를 가짐
- 결과적으로 논문은 audio, text encoder와 linear projection을 jointly training 하기 위해, similarity matrix에 다음과 같은 symmetric cross-entropy loss $\mathcal{L}$을 사용함:
  (Eq. 4) $\mathcal{L}=0.5*\left(\ell_{text}(C)+\ell_{audio}(C)\right)$
  - $\ell_{k}=\frac{1}{N}\sum_{i=0}^{N}\log \text{diag} (\text{Softmax}(C))$

- Zero-Shot Linear Classification

Zero-shot classification을 위해 $C$ class label과 $N$ test audio를 가진 target dataset을 고려하자
- 먼저 pre-trained encoder와 projection layer를 통해 $N$ audio와 $C$ class에 대한 audio embedding, text embedding을 compute 함
- 이후 각 testing audio와 모든 class label 간의 cosine similarity를 compute 함
  - 그러면 각 audio는 class label 수만큼의 logit을 가짐
- 최종적으로 해당 logit에 binary/multiclass classification을 위한 softmax/sigmoid를 적용하여 probability distribution을 얻음

3. Experiments

- Settings

Dataset : FSD50K, Clotho V2, AudioCaps, MACS

- Results

여러 downstream task에 대해 CLAP은 뛰어난 성능을 보임

Effect of Freezing CLAP Encoders
- Text encoder를 unfreeze 하면 audio encoder를 unfreeze 하는 것보다 더 나은 성능을 달성할 수 있음

Changing Prompts in Zero-Shot Evaluation
- Prompt 설정에 따라 $5\%$의 성능 향상 효과를 얻을 수 있음

'Paper > Representation' 카테고리의 다른 글

[Paper 리뷰] STaR: Distilling Speech Temporal Relation for Lightweight Speech Self-Supervised Learning Models (0)	2025.08.27
[Paper 리뷰] Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking Distillation (0)	2025.08.26
[Paper 리뷰] SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing (0)	2025.07.12
[Paper 리뷰] BEATs: Audio Pre-Training with Acoustic Tokenizers (0)	2025.06.28
[Paper 리뷰] Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear Equation (0)	2025.06.24

최근에 올라온 글

최근에 달린 댓글

« 2026/03 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] CLAP: Learning Audio Concepts from Natural Language Supervision

CLAP: Learning Audio Concepts from Natural Language Supervision

1. Introduction

2. Method

- Contrastive Language-Audio Pre-Training

- Zero-Shot Linear Classification

3. Experiments

- Settings

- Results

'Paper > Representation' 카테고리의 다른 글

티스토리툴바