[Paper 리뷰] Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech

티스토리 뷰

Paper/Representation

[Paper 리뷰] Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech

feVeRin 2025. 3. 20. 21:44

Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech

Speech corpus로부터 얻어진 audio segment의 fixed-length vector representation을 학습하여 semantic information을 얻을 수 있음
Speech2Vec
- RNN encoder-decoder framework를 기반으로 semantically simillar 한 embedding을 얻음
- Training을 위해 Skipgrams, Continuous Bag-of-Words를 활용
논문 (INTERSPEECH 2018) : Paper Link

1. Introduction

Natural Language Processing (NLP)에서는 Word2Vec, GloVe 등을 활용하여 word를 fixed-dimensional vector나 word embedding으로 변환함
- 이와 비슷하게 speech에서도 vector representation을 얻을 수 있음
- BUT, 일반적인 representation은 semantic이 아닌 acoustic-phonetic notion에 기반하므로 서로 다른 instance가 latent embedding space의 same point에 mapping 될 수 있음

-> 그래서 word와 관련된 acoustic segment가 아닌 neighboring acoustic region에 focus 할 수 있는 acoustic embedding model인 Speech2Vec을 제안

Speech2Vec
- RNN Encoder-Decoder framework를 기반으로 arbitrary length의 audio segment를 처리
- Word2Vec과 같이 Skipgrams/Continuous Bag-of-Words (CBOW) approach를 통해 model을 training

< Overall of Speech2Vec >

RNN Encoder-Decoder framework를 기반으로 Skipgrams, CBOW training을 적용
결과적으로 기존보다 뛰어난 성능의 embedding을 획득

2. Method

Acoustic feature (MFCC 등)에 대한 variable length sequence로 represent 된 word에 대해,
- 논문은 audio segment의 fixed-length embedding $\mathbf{x}=(\mathbf{x}_{1},\mathbf{x}_{2},...,\mathbf{x}_{T})$을 학습하는 것을 목표로 함
  - $\mathbf{x}_{t}$ : time $t$에서의 acoustic feature, $T$ : sequence length
- 결과적으로 해당 word embedding은 original audio segment의 semantic을 describe 할 수 있어야 함

- RNN Encoder-Decoder Framework

Recurrent Neural Network (RNN) encoder-decoder는 Encoder RNN/Decoder RNN으로 구성됨
- 먼저 input sequence $\mathbf{x}=(\mathbf{x}_{1},\mathbf{x}_{2}, \mathbf{x}_{T})$에 대해 encoder는 각 symbol $\mathbf{x}_{i}$를 sequentially read 하고 RNN의 hidden state $\mathbf{h}_{t}$를 update 함
- 이후 last symbol $\mathbf{x}_{T}$가 process 된 다음, 해당 hidden state $\mathbf{h}_{T}$는 entire input sequence의 learned representation으로 interpret 됨
- 최종적으로 $\mathbf{h}_{T}$를 사용하여 hidden state를 initializing 함으로써 decoder는 output sequence $\mathbf{y}=(\mathbf{y}_{1},\mathbf{y}_{2},...,\mathbf{y}_{T'})$를 sequentially generate 함
  - 이때 $T, T'$은 서로 다를 수 있음

- Speech2Vec

논문은 Speech2Vec을 training하기 위해 Skipgrams와 Continuous Bag-of-Words (CBOW)의 2가지 method를 도입함
Training Speech2Vec with Skipgrams
- Skipgrams는 speech corpus의 각 audio segment (word) $\mathbf{x}^{(n)}$에 대해, Speech2Vec이 certain range $k$ 내에서 $\mathbf{x}^{(n)}$ 전후의 audio segment (nearby word)$\{\mathbf{x}^{(n-k)},...,\mathbf{x}^{(n-1)},\mathbf{x}^{(n+1)},...,\mathbf{x}^{(n+k)}\}$를 predict 하도록 training 됨
- Training 중에 encoder는 $\mathbf{x}^{(n)}$을 input으로 사용하여 fixed-dimensional vector representation $\mathbf{z}^{(n)}$으로 encoding 됨
  1. 이후 decoder는 $\mathbf{z}^{(n)}$을 several output sequence $\mathbf{y}^{(i)},i\in\{n-k,...,n-1,n+1,...,n+k\}$에 mapping 함
  2. 이때 model은 Mean Squared Error $\sum_{i}||\mathbf{x}^{(i)}-\mathbf{y}^{(i)}||^{2}$를 사용해 output sequence와 해당 nearby audio segment 간의 gap을 minimize 하는 방식으로 training 됨
- 해당 방식은 nearby audio segment를 successfully decode 하기 위해서는 encoded vector representation $\mathbf{z}^{(n)}$에 current audio segment $\mathbf{x}^{(n)}$에 대한 sufficient semantic information이 포함되도록 함
  - Training 이후 $\mathbf{z}^{(n)}$은 $\mathbf{x}^{(n)}$의 word embedding으로 취급함
Training Speech2Vec with CBOW
- Skipgrams Speech2Vec은 nearby audio segment를 $\mathbf{z}^{(n)}$에서 predict 하지만, CBOW Speech2Vec의 경우 $\mathbf{x}^{(n)}$을 target으로 설정하고 nearby audio segment로부터 target을 infer 하는 것을 목표로 함
- Training 중에 모든 nearby audio segment는 shared encoder에 의해 $\mathbf{h}^{(i)},i\in\{n-k,...,n-1,n+1,...,n+k\}$로 encoding 되고 해당 summation $\mathbf{z}^{(n)}=\sum_{i}\mathbf{h}^{(i)}$은 decoder를 통해 $\mathbf{x}^{(n)}$을 생성하는 데 사용됨
  - Training 이후 $\mathbf{z}^{(n)}$은 $\mathbf{x}^{(n)}$에 대한 word embedding으로 사용됨
- 실험적으로는 Skipgrams Speech2Vec이 CBOW Speech2Vec 보다 더 나은 성능을 보임

- Differences Between Speech2Vec and Word2Vec

Speech2Vec은 Word2Vec의 speech version으로써 audio의 spoken word에서 semantic information을 capture 하는 audio segment의 fixed-length embedding을 학습하는 것을 목표로 함
- BUT, Word2Vec과는 다음의 차이점을 가짐:
  1. Word2Vec architecture는 input/output으로 one-hot encoded vector를 사용하는 fully-connected neural network로 구성됨
    - 반면 Speech2Vec은 acoustic feature의 variable length를 처리하기 위해 RNN encoder-decoder를 사용함
  2. Word2Vec에서 particular word의 embedding을 deterministic 함
    - 즉, 동일한 word의 모든 instance는 하나의 embedding vector로 represent 됨
    - 반면 Speech2Vec은 spoken word의 모든 instance가 서로 다르므로, 동일한 word라도 instance가 fully similar 하지만 서로 차이가 있는 embedding vector로 represent 됨

3. Experiments

- Settings

Dataset : LibriSpeech
Comparisons : Word2Vec

- Results

전체적으로 Skipgrams Speech2Vec이 가장 좋은 성능을 달성함
- Skipgrams Speeech2Vec은 text 보다 prosody와 같은 speech의 semantic information을 capture하기 때문
- 추가적으로 embedding size를 늘리는 것이 항상 성능 향상으로 이어지지는 않음

Impact of Training Corpus Size
- Training size가 클수록 성능이 향상됨

Variance Study
- 모든 word를 corpus에 apear 한 횟수 $N$을 기준으로 $5\text{~}99,100\text{~}999,1000\text{~}9999,\geq 10k$의 sub-group으로 partition 함
  - 이후 $N$번 apear 한 given word $w$의 모든 vector representation $\{\mathbf{w}^{1},\mathbf{w}^{2},...,\mathbf{w}^{N}\}$에 대해 각 dimension의 standard deviation을 계산
- 결과적으로 Skipgrams model은 CBOW model 보다 더 낮은 variance를 보임

Visualization
- $t$-SNE 측면에서 learned word embedding은 antonym/synonym을 capture 하는 것으로 나타남

'Paper > Representation' 카테고리의 다른 글

[Paper 리뷰] Data2Vec 2.0: Efficient Self-Supervised Learning with Contextualized Target Representations for Vision, Speech and Language (0)	2025.04.06
[Paper 리뷰] Data2Vec: A General Framework for Self-Supervised Learning in Speech, Vision and Language (0)	2025.04.05
[Paper 리뷰] XLSR: Unsupervised Cross-Lingual Representation Learning for Speech Recognition (0)	2025.04.04
[Paper 리뷰] Wav2Vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (0)	2025.03.23
[Paper 리뷰] Wav2Vec: Unsupervised Pre-Training for Speech Recognition (0)	2025.03.22

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech

Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech

1. Introduction

2. Method

- RNN Encoder-Decoder Framework

- Speech2Vec

- Differences Between Speech2Vec and Word2Vec

3. Experiments

- Settings

- Results

'Paper > Representation' 카테고리의 다른 글

티스토리툴바