[Paper 리뷰] PAST: Phonetic-Acoustic Speech Tokenizer

티스토리 뷰

Paper/Neural Codec

[Paper 리뷰] PAST: Phonetic-Acoustic Speech Tokenizer

feVeRin 2025. 9. 24. 17:02

PAST: Phonetic-Acoustic Speech Tokenizer

Signal reconstruction과 phonetic information을 jointly modeling 할 수 있음
PAST
- Pre-trained self-supervised model 없이 supervised phonetic data를 사용하여 auxiliary task를 통해 domain knowledge를 tokenization process에 integrate
- 추가적으로 real-time application을 위한 streamable architecture를 구성
논문 (INTERSPEECH 2025) : Paper Link

1. Introduction

Speech language model은 일반적으로 acoustic token이나 phonetic (semantic) token을 활용함
- EnCodec, SoundStream과 같은 acoustic tokenizer는 high-fidelity waveform reconstruction이 가능하지만 external text supervision이 없으므로 language modeling에는 부적합함
  - Wav2Vec 2.0, HuBERT와 같은 phonetic tokenizer는 linguistic information을 capture 할 수 있지만 reconstruction quality가 떨어짐
- 한편으로 SpeechTokenizer, X-Codec과 같은 hybrid tokenizer는 phonetic, acoustic information을 unified representation으로 integrate 할 수 있음
  - BUT, hybrid 방식은 pre-trained Self-Supervised Learning (SSL) model에 의존적이므로 computationally expansive 하고 input의 phonetic richness를 fully capture 하기 어려움

-> 그래서 external pre-trained model에 의존하지 않는 hybrid tokenizer인 PAST를 제안

PAST
- Pre-trained model, external vocoder 없이 supervised data를 사용하여 phonetic/acoustic representation을 jointly learning
- 추가적으로 previous context만 사용하여 causally operate 하는 streaming-compatible variant를 구성

< Overall of PAST >

External model 없이 jointly training 되는 hybrid neural codec
결과적으로 기존보다 우수한 성능을 달성

2. Method

- Problem Setup

PAST는 encoder, quantizer, decoder의 3가지 component로 구성됨
- Duration $t$의 waveform signal $x\in\mathbb{R}^{f_{s}\cdot t}$가 $f_{s}$로 sampling 되면 encoder는 $x$를 dense latent representation $z\in\mathbb{R}^{D\times T}$로 transform 함
  - $T=f_{r}\cdot t$ : latent space의 temporal resolution으로써 frame rate $f_{r}$에 의해 결정됨, $D$ : latent dimension
- 다음으로 quantizer는 $z$를 처리하여 quantized latent representation $\hat{z}\in\mathbb{R}^{D\times T}$를 생성함
- 최종적으로 decoder는 original signal을 reconstruct 하여 $\hat{x}\in\mathbb{R}^{f_{s}\cdot t}$를 생성함
- 이때 논문은 encoded latent representation에서 phonetic content를 capture 하기 위해 phoneme, character-level transcription supervision pair를 사용함
  - 즉, PAST는 $x,\hat{x}$ 간의 reconstruction error를 minimize 하고 encoded latent representation $z$가 meaningful phonetic information을 capture 하는 것을 목표로 함

- Model Architecture

PAST architecture는 EnCodec을 기반으로 함
- Encoder block은 convolutional encoder module과 Transformer encoder module로 구성됨
- Training stability를 위해 quantization module input은 3가지 mode 중에서 choice 됨:
  1. Probability $p_{trans\text{-}only}$의 Transformer block output
  2. Probability $p_{skip\text{-}only}$의 Encoder (skip-connection) output
  3. 위 두 output의 average
    - 추론 시에는 averaged representation만 사용함
- Quantization module에는 Residual Vector Quantization (RVQ)가 적용됨
  1. RVQ component는 $N_{q}$ sequential Vector Quantization (VQ) layer로 구성되고 $z$와 해당 residual을 iteratively quantize 함
  2. 즉, $z\in\mathbb{R}^{D\times T}$가 주어지면 $\tau \in T$에 대해 first VQ module은 $z_{\tau}$를 learned embedding table에서 closest entry로 replace 하여 $\hat{z}_{1}$을 생성함
    - 이후 해당 process는 residue에 대해 next VQ layer $i\in\{2,...,N_{q}\}$에서 repreat 됨: $\text{VQ}_{i}(z-\sum_{j\in[i-1]}\hat{z}_{j})=\hat{z}_{i}$
- RVQ module은 $N_{q}$ quantized stream을 output 하고, 이는 quantized vector $\hat{z}_{i}$ 또는 각 $\text{VQ}_{i}$에 대한 embedding table의 index $q_{i}\in\mathbb{N}^{T}$로 represent 됨
- Decoder는 convolutional encoder module을 mirror 하고 strided convolution을 transposed convolution layer로 replace 함
  - 이때 decoder input은 $\hat{z}=\sum_{i\in[N_{q}]}\hat{z}_{i}$와 같음
- Streamable Configuration
  1. PAST의 streamable variant는 left-only padding을 사용한 causal convolution, unidirectional LSTM, causal attention으로 구성됨
  2. 해당 설정은 audio signal에 대해 $20$ms look-ahead를 요구함

- Auxiliary Heads

Phonetic information을 encoding 하기 위해 논문은 first quantized output stream $\hat{z}_{1}$에서 동작하는 auxiliary head와 training objective를 도입함
- 이를 통해 pseudo-phonetic token distillation을 target character transcription, phoneme을 사용한 direct supervision으로 replace 함
CTC Character Match
- CTC auxiliary head는 $\hat{z}_{1}\in\mathbb{R}^{D\times T}$를 input으로 하여 각 entry $y\in\mathbb{R}^{|M|\times T}$에 대해 all character set $M$의 distribution을 output 함
- 구조적으로 module은 $D$에서 hidden dimension $h$로의 linear projection과 single-layer BiLSTM, $h$에서 $|M|$으로의 linear projection으로 구성됨
- Predicted sequence와 transcription target을 align 하기 위해 Connectionist Temporal Classification (CTC) loss $\mathcal{L}_{ctc}=\text{CTC}(y|\text{chars})$가 적용됨
Phoneme Classification
- Second auxiliary head는 $\hat{z}_{1}$을 input으로 하여 각 entry $\hat{p}\in\mathbb{R}^{|P|\times T}$에 대해 all phoneme set $P$의 distribution을 output 하는 simple linear projection으로 구성됨
- Auxiliary head는 Cross-Entropy loss $\mathcal{L}_{phn}=\text{CE}(\hat{p},p)$를 통해 training 됨

- Training Objective

논문은 EnCodec의 reconstruction objective $\mathcal{L}_{EnCodec}$에 2개의 auxiliary term을 추가함
- 그러면 PAST의 overall training objective는:
  (Eq. 1) $\mathcal{L}=\lambda_{ctc}\mathcal{L}_{ctc}+\lambda_{phn}\mathcal{L}_{phn}+\mathcal{L}_{EnCodec}$
  - $\lambda_{ctc},\lambda_{phn}$ : 각각 CTC loss $\mathcal{L}_{ctc}$, phoneme loss $\mathcal{L}_{phn}$의 weight

3. Experiments

- Settings

Dataset : LibriSpeech, TIMIT
Comparisons : SpeechTokenizer, X-Codec

- Results

전체적으로 PAST의 성능이 가장 뛰어남

Signal reconstruction 측면에서도 우수한 성능을 보임

Speech Language Modeling (SLM) 측면에서도 PAST가 가장 뛰어남

Component Analysis
- 각 component를 모두 사용할 때 최상의 성능을 얻을 수 있음

Skip-connection dropout을 사용하면 더 나은 성능을 달성할 수 있음

'Paper > Neural Codec' 카테고리의 다른 글

[Paper 리뷰] TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling (0)	2025.11.11
[Paper 리뷰] FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks (0)	2025.11.05
[Paper 리뷰] Factorized RVQ-GAN for Disentangled Speech Tokenization (0)	2025.09.22
[Paper 리뷰] LSPNet: An Ultra-Low Bitrate Hybrid Neural Codec (0)	2025.09.16
[Paper 리뷰] SpecTokenizer: A Lightweight Streaming Codec in the Compressed Spectrum Domain (0)	2025.09.09

최근에 올라온 글

최근에 달린 댓글

« 2025/12 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] PAST: Phonetic-Acoustic Speech Tokenizer

PAST: Phonetic-Acoustic Speech Tokenizer

1. Introduction

2. Method

- Problem Setup

- Model Architecture

- Auxiliary Heads

- Training Objective

3. Experiments

- Settings

- Results

'Paper > Neural Codec' 카테고리의 다른 글

티스토리툴바