[Paper 리뷰] Pengi: An Audio Language Model for Audio Tasks

티스토리 뷰

Paper/Language Model

[Paper 리뷰] Pengi: An Audio Language Model for Audio Tasks

feVeRin 2024. 3. 7. 10:29

Pengi: An Audio Language Model for Audio Tasks

Audio domain에서 사용되는 language model에는 Audio Captioning이나 Audio Question Answering과 같은 open-ended task를 처리하는 기능이 부족함
Pengi
- 모든 audio task를 text generation task로 framing 하고 transfer learning을 적용하는 audio language model
- Text encoder와 audio encoder는 continuous embedding sequence로 각각의 input을 represent 하고, 얻어진 두 sequence는 pre-trained frozen language model을 prompt 하는 prefix로 결합됨
- Pengi는 추가적인 fine-tuning이나 task-specific extention 없이 open-/close-ended task를 지원
논문 (NeurIPS 2024) : Paper Link

1. Introduction

Sound event, scene classification, audio retrieval, audio captioning과 같은 audio task는 서로 연관되어 있으므로 Transfer Learning (TL)을 활용할 수 있음
- TL은 하나의 task로부터 얻은 knowledge를 연관된 task로 확장하는 것에 중점을 둠
  1. 다양한 task에 대한 대규모 dataset을 사용하여 모델을 pre-training한 다음, target dataset에 대해 fine-tuning 하는 방식을 주로 활용함
    - 이를 통해 모델은 다양한 downstream task에서 사용할 수 있는 general-purpose audio representation을 학습할 수 있음
  2. Unlabeled audio의 경우, self-supervised나 unsupervised learning을 사용할 수 있음
    - BUT, downstream task에 적용하기 위해서는 추가적인 fine-tuning step이 필요함
  3. 한편으로 zero-shot learning과 같은 TL 전략은 fine-tuning에 대한 의존성을 제거할 수 있음
    - Contrastive objective를 사용하여 natural language description과 audio content 간의 similarity를 학습하여 class label을 identify 하는 score를 제공하는 방식
    - BUT, zero-shot 모델은 Audio Captioning이나 Audio Question Answering (AQA)와 같은 open-ended task에 활용하기 어려운 단점이 있음
- 하지만, open-ended task를 지원하는 기존의 audio language model들은 close-ended task를 전혀 지원하지 않음
  - 즉, audio domain에서 open-/close-ended task를 모두 지원하는 language model은 아직 없음

-> 그래서 transfer learning을 활용하여 open-/close-ended task를 모두 지원하는 audio language model인 Pengi를 제안

Pengi
- Audio recording과 text prompt를 input으로 사용하고 free-form text를 output 하는 Audio Language Model
- 추가적인 fine-tuning이나 task-specific extention 없이 open-/close-ended task를 모두 지원
- 모든 audio task를 text output task에 대한 audio, text input으로 구성
  - Single training procedure와 captioning objective function을 활용
  - Training을 위해 Instruction Tuning을 활용한 audio task template를 설계

< Overall of Pengi >

모든 audio task를 text generation task로 framing하고 transfer learning을 적용하는 Audio Language Model
추가적인 fine-tuning이나 task-specific extention 없이 open-/close-ended task를 지원
결과적으로 다양한 audio domain의 21개의 downstream task에 대해 최고 성능을 달성

Audio, Text prompt input과 그에 대한 Textual response 예시

2. Approach

Pengi는 모든 audio task를 text-generation task로 framing하고 transfer learning을 적용하는 audio language model
- Input으로 audio recording과 text prompt를 사용하고, free-form text를 output으로 생성함
- Unified architecture를 통해 추가적인 fine-tuning이나 task-specific extension 없이 open-/close-ended task를 지원

- Unified Architecture

Audio Encoder
- Audio encoder $a_{\phi}$는 raw audio input을 audio embedding으로 변환하는 역할
  - CLAP의 audio transformer backbone을 audio encoder로 사용
  - CLAP은 computer vision task에 비해 더 작은 magnitude의 audio-text pair에 대해 train 됨
- 따라서 training procedure를 위해 weight를 unfroze 함
Text Encoder
- Text encoder $g_{\psi}$는 input text prompt를 text embedding으로 변환하는 역할
  - 여기서 prompt는 task-specific이나 question과 같은 모든 형태의 natural language를 의미
- Text encoder는 training 중에 frozen 되어 weight가 update 되지 않음
  - Text encoder는 off-the-shelft text encoder를 사용하여 close-ended task를 잘 수행할 수 있도록 함
Mapping Network and Prefix
- Causal language model에 제공할 prefix를 구성하기 위해, 2개의 mapping network $m_{1}, m_{2}$를 사용
- Mapping network는 embedding을 $k$ embedding sequence로 변환하는 역할
  - 이때 audio embedding은 $m_{1}$으로, text embedding은 $m_{2}$로 변환되고, 두 변환 모두 trainable 함
- 이후 두 sequence가 concatenate 되어 fixed-length prefix를 생성함
Causal Language Model
- Text output을 생성하기 위해 Pengi는 학습/추론 시 frozen 상태로 유지되는 pre-trained autoregressive causal language model을 도입
- Language model이 frozen 되어 있더라도, audio prefix는 mapping network $m_{1}$ 및 audio encoder $a_{\phi}$의 parameter를 gradient descent와 backpropagation으로 최적화할 수 있는 gradient를 receive 함
- 추론 시 language model은 audio, text prefix에 따라 autoregressively condition 된 token을 생성함

- Training and Inference

Pengi는 모든 audio task를 text output task에 대한 audio, text input으로 구성하는 새로운 학습 framework를 도입
- 이를 위해 single training procedure와 objective function을 사용
  1. Audio-text-to-text 형식의 training data를 $\{x^{i}, t^{i},c^{i}\}$라 하자
    - $x^{i}, t^{i}, c^{i}$ : 각각 $i$-th audio file, input text, output text/caption
  2. Prefix를 생성하기 위해 audio encoder $a_{\phi}$와 mapping network $m_{1}$은 audio $x^{i}$를 $k$개의 embedding sequence로 project 함
    - 마찬가지로 text encoder $g_{\psi}$와 mapping network $m_{2}$는 input text $t^{i}$를 $k$개 embedding sequence로 project 함
  3. 이후 두 sequence가 concatenate 되어 pre-trained frozen language model $f_{\theta}$에 대한 prefix $p^{i}$를 생성:
    (Eq. 1) $p^{i}=p^{i}_{1},...,p_{2k}^{i}=\textrm{concat}\{m_{1}(a_{\phi}(x^{i})),m_{2}(g_{\psi}(t^{i}))\}$
  4. Language model $f_{\theta}$에는 모든 $\{z^{i}\}_{i=1}^{N}$의 prefix-caption concatenation이 제공되고, 이때 $z^{i}$는:
    (Eq. 2) $z^{i}=p^{i}_{1},...,p_{2k}^{i},c_{1}^{i},...,c_{l}^{i}$
  5. Model은 autoregressive 방식으로 prefix를 condition으로 caption (text token) $c^{i}$을 예측하는 standard captioning system으로 학습되고, 이때 loss function은 Cross-Entropy를 사용:
    (Eq. 3) $\mathcal{L}=-\sum_{i=1}^{N}\sum_{j=1}^{l}\log p_{\gamma}(c_{j}^{i}|p_{1}^{i},...,p_{2k}^{i},c_{1}^{i},...,c_{j-1}^{i})$
    - $\gamma$ : audio encoder parameter $phi$와 두 mapping network의 parameter를 포함하는 모델의 trainable parameter
    - 여기서 text encoder와 causal language model은 frozen 됨
- 추론 시 prefix는 test audio와 text prompt를 사용하여 구성됨
  - Causal language model $f_{\theta}$는 prefix으로 condition 되어 sequential 하게 next-token을 생성함
  - Language model은 각 prediction에서 모든 vocabulary token에 probability를 assign 하고, 이는 decoding choice에 따라 next-token을 결정하는 데 사용됨
  - 논문에서는 추론과 downstream task를 위해 beam size 5의 beam search decoding을 사용함

3. Experiments

- Settings

Dataset : AudioSet, FSD50K, CochlScene, MSP Podcast, CMU MOSI, CMU MOSEI, MELD, NSynth, FMA, AudioCaps, ClothoV2, ClothoAQA, WavText5K, MACS, SoundDecs, WavCaps, FreeSound, FindSound
Comparisons : CLAP

- Results

Benchmarking Pengi
- Pengi는 close-ended task, open-ended task 모두를 지원하는 audio model임
- Pengi는 audio captioning, AQA와 같은 open-ended task를 지원하지만 CLAP은 close-ended task만 지원 가능함
- 그 외의 closed-ended task 측면에서도 Pengi는 CLAP 보다 더 나은 성능을 발휘함

Audio Captioning
- Audio captioning 측면에서 Pengi는 기존의 supervised model들보다 뛰어난 성능을 보임
- 특히 AudioCaps의 경우 6.6%, Clotho에서는 26%의 성능 향상을 달성함

Shared audio encoder를 multi-task learning으로 학습하면 Pengi는 개별적인 task를 더 잘 처리할 수 있음
- 이를 위해 아래와 같은 setting을 사용하여 성능을 비교
  - A : 'generate audio caption'이라는 text prompt가 있는 audio captioning data에 대해서만 Pengi를 training
  - B : multi-task learning을 적용한 Pengi
- 결과적으로 audio captioning에서 multi-task learning을 활용하면 Pengi의 성능을 일관적으로 향상할 수 있음

Audio Captioning에서 Multi-Task Learning의 효과

AQA
- AQA 측면에서도 성능을 비교해 보면, Pengi는 기존 supervised model 보다 1.5% 이상의 성능 개선을 기록함

Zero-shot Sound Event Classification
- Zero-shot sound classification 측면에서도 Pengi는 가장 우수한 성능을 보임

Text-to-Audio Retrieval
- Pengi의 text-to-audio retrieval 성능을 비교해 보면,
- Pengi는 indexing, query matching을 사용하는 모델들 보다는 우수한 성능을 보임
- Retrieval을 위해 text를 audio에 직접 matching 하는 contrastive model에 비해서는 다소 낮은 성능을 보임

Next-text token Prediction for Learning Audio Representations
- Pengi는 next-text token을 사용하여 audio representation을 학습하므로 해당 representation의 효과를 실험
  - 이를 위해 linear probe와 shallow learning을 적용
- 결과적으로 Pengi의 linear probe $L_{1}, L_{3}$의 성능은 CLAP보다 우수함
  - 특히 sound event, music domain에서 Pengi는 최고 성능을 보이므로 next-text token은 다양한 domain에 유용한 audio representation을 학습하는데 도움을 줌

'Paper > Language Model' 카테고리의 다른 글

[Paper 리뷰] Textually Pretrained Speech Language Models (0)	2024.03.31
[Paper 리뷰] AudioLM: A Language Modeling Approach to Audio Generation (0)	2024.03.10
[Paper 리뷰] MusicLM: Generating Music From Text (0)	2024.03.09
[Paper 리뷰] AudioGen: Textually Guided Audio Generation (0)	2024.03.05
[Paper 리뷰] Simple and Controllable Music Generation (0)	2024.03.04

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Pengi: An Audio Language Model for Audio Tasks

Pengi: An Audio Language Model for Audio Tasks

1. Introduction

2. Approach

- Unified Architecture

- Training and Inference

3. Experiments

- Settings

- Results

'Paper > Language Model' 카테고리의 다른 글

티스토리툴바