[Paper 리뷰] HiFi-Codec: Group-Residual Vector Quantization for High Fidelity Audio Codec

티스토리 뷰

Paper/Neural Codec

[Paper 리뷰] HiFi-Codec: Group-Residual Vector Quantization for High Fidelity Audio Codec

feVeRin 2024. 5. 17. 10:28

HiFi-Codec: Group-Residual Vector Quantization for High Fidelity Audio Codec

Audio codec은 audio를 discrete representation으로 compress 하는 것으로써, 최근에는 생성 분야에서 intermediate representation으로 활용되고 있음
BUT, audio codec은 large-scale dataset 부족과 reconstruction 성능 보장을 위한 codebook size의 부담으로 인한 어려움이 있음
HiFi-Codec
- 생성 모델의 부담을 완화하기 위해 Group-Residual Vector Quantization을 도입
- 결과적으로 4개의 codebook 만으로도 high-fidelity의 audio reconstruction을 보장
논문 (arXiv 2023) : Paper Link

1. Introduction

Audio codec은 audio signal의 품질을 크게 저하시키지 않으면서, audio signal을 store 하거나 transmit 하는데 필요한 data 양을 줄이는 것을 목표로 함
- 즉, audio codec은 기본적으로 signal에서 redundant, irrelevant information을 제거하는 방식으로 동작함
- 최근에는 text-to-speech (TTS), audio generation과 같은 task에 audio codec이 활용되고 있음
  1. 대표적으로 EnCodec과 SoundStream은 Residual Vector Quantization (RVQ)를 활용하여 우수한 성능을 달성함
    - 여기서 RVQ는 multiple VQ codebook을 사용하여 intermediate feature를 represent 하는 역할
  2. 특히 해당 neural codec들은 일반적으로 encoder-decoder framework를 기반으로 함
    - Encoder는 waveform을 compact deep representation으로 compress한 다음, RVQ를 통해 intermediate feature로 quantize 함
    - Decdoer는 quantized representation으로부터 wavefrom을 recover 함
- 한편으로 RVQ에서 대부분의 information은 첫 번째 codebook에 저장되고, 이후의 codebook은 audio quality에 영향을 주는 detail information을 포함함
  - BUT, 해당 detail들은 sparse 하고 hidden space에 scatter 되어 있으므로, 기존의 quantization style은 reconstruction을 위해 많은 수의 codebook이 필요함
  - 대표적으로 EnCodec의 경우, 고품질의 reconstruction을 위해 12개의 codebook이 사용됨
- 결과적으로 audio generation에서 codebook 수의 증가로 발생하는 long sequence는 transformer를 통해 모델링하기 어려우므로, reconstruction의 한계가 나타남

-> 그래서 우수한 reconstruction을 보장하면서 codebook size를 줄일 수 있는 HiFi-Codec을 제안

HiFi-Codec
- Codebook 수를 줄이기 위해 Group-Residual Vector Quantization (GRVQ)를 도입
  1. 이를 위해 latent feature $z\in\mathbb{R}^{N}$에 대해, $\{z_{1},z_{2}\}$와 같이 $z$를 여러 개의 group으로 split 함
  2. 다음으로 RVQ를 적용하여 $z_{1}$과 $z_{2}$를 각각 quantize 함
  3. 최종적으로 두 RVQ group의 information을 결합하여 waveform을 decode 함
- 이는 첫 번째 layer의 codebook이 더 많은 information을 저장한다는 것에 기반하여, 첫 번째 layer의 codebook이 compressing process에서 주요하게 사용되도록 함

< Overall of HiFi-Codec >

Codebook size를 줄이면서 고품질의 reconstruction을 보장하기 위해 Group-Residual Vector Quantization을 도입
결과적으로 feature를 2개의 group으로 split 하고 2개의 residual layer를 적용하여, 단 4개의 codebook 만으로도 기존보다 우수한 성능을 달성

2. Method

- Overview

논문에서는 sequence $x\in\mathbb{R}^{T}$로 represent 되는 duration $d$를 가지는 single-channel audio signal $x$를 고려함
- 이때 $T=d*sr$이고, $sr$은 sampling rate
- 구조적으로 HiFi-Codec은 3가지 component로 구성됨:
  1. Input audio로부터 latent feature representation $z$를 생성하는 encoder network $E$
  2. Compressed representation $z_{q}$를 생성하는 Group-Residual Quantization layer $Q$
  3. Compressed latent representation $z_{q}$로부터 audio signal $\hat{x}$를 reconstruction 하는 decoder $G$
- 결과적으로 HiFi-Codec은 서로 다른 resolution에서 동작하는 discriminator form의 perceptual loss와 time, frequency domain 모두에 대한 reconstruction loss를 최적화하여 end-to-end training 됨

- Encoder and Decoder

HiFi-Codec은 latent representation에 대한 sequential modeling이 포함된 encoder-decoder architecture를 활용함
- 먼저 encoder $E$는 $C$ channel과 7 kernel size를 가지는 1D convolution과 $B$개의 convolution block으로 구성됨
  1. 구조적으로는 EnCodec과 SoundStream을 기반으로 함
  2. 먼저 각 convolution block은 single residual unit과 stirde $S$의 두 배에 해당하는 kernel size $K$를 가지는 strided convolution으로 구성된 downsampling layer를 가짐
  3. 이때 residual unit은 kernel size가 3인 2개의 convolution과 skip connection으로 구성되고, downsampling 할 때마다 channel 수는 2배로 증가함
- Convolution block 다음에는 sequence modeling을 위한 2-layer LSTM과 kenel size가 7이고 output channel이 $D$인 final 1D convolution layer가 추가됨
  - 논문에서는 $C=[32, 48, 64], B=4, S=(2,4,5,8) \, / \, (2,4,5,6) \, / \, (2,2,2,4)$로 설정하여 사용
- Decoder의 경우 encoder의 반대로써, transposed convolution을 사용해 audio signal을 output 함

- Group-Residual Vector Quantization (GRVQ)

GRVQ는 우수한 reconstruction 성능을 유지하면서 더 적은 수의 quantizer를 사용하도록 설계됨
- 기존 RVQ는 첫 번째 codebook layer가 대부분의 infromation을 저장하고 나머지 codebook은 일부 information만 저장하는 단점이 있음
- 따라서 GRVQ는 아래 [Algorithm 1]과 같이 첫번째 layer에 더 많은 codebook을 추가하는 방식을 사용함
  1. 즉, latent feature representation $z$을 여러 group으로 split 하고, multiple RVQ를 사용하여 각 group feature를 quantize 함
    - 논문에서는 $z$를 $z_{1}, z_{2}$의 두 group으로 split 함
  2. 이후 multiple group RVQ output을 concatenation 하여 final quantization을 얻음

Group-Residual Vector Quantization Algorithm

- Discriminator

HiFi-Codec은 EnCodec의 Multi-Scale STFT (MS-STFT) Discriminator, HiFi-GAN의 Multi-Period Discriminator (MPD)와 Multi-Scale Discriminator (MSD)의 3가지 discriminator를 사용함
- MS-STFT discriminator의 경우, 실수부와 허수부가 concatenate 된 multi-scaled complex-valued STFT에서 동작하는 identically structured network를 사용
  1. 각 sub-network에 대해, time dimension $(1,2,4)$에서 dilation rate가 증가하고 frequency axis에 대해 stride가 2인 2D convolution으로 구성
  2. 이후 kernel size가 $3\times 3$이고 stride가 $(1,1)$인 final 2D convolution을 통해 최종 prediction을 수행
    - STFT window length의 경우 $[2048, 1024,512,256,128]$의 서로 다른 scale을 사용
- MPD와 MSD 역시 HiFi-GAN과 동일한 structrue를 활용하지만, 앞선 MS-STFT discriminator와 같은 parameter를 가지도록 channel 수를 줄임

- Training Loss

HiFi-Codec은 GAN objective를 기반으로 generator와 discriminator 모두를 최적화함
- Generator의 training objective는 time-domain term, frequency-domain term, 3개의 discriminator loss, feature loss, GRVQ commitment loss로 구성됨
- Discriminator loss는 adversarial hinge loss를 기반으로 함
Reconstruction Loss
- Reconstruction loss는 time-domain loss와 time-frequency loss의 두 가지 측면으로 구성됨
- Time-domain loss의 경우, $L1$ distance를 사용하여 $x$와 $\hat{x}$간의 차이를 최적화함
- Time-Frequency loss의 경우, EnCodec의 방식을 따라 여러 time scale에 대한 mel-spectrogram에 loss term을 적용
Discriminator Loss
- Adversarial loss는 perceptual quality를 향상하기 위해 사용됨
  - MS-STFT discriminator는 spectrogram-level reconsturction result를 original과 유사하도록 만듦
  - MPD, MSD는 waveform-level reconstruction result를 original과 유사하도록 만듦
- 결과적으로 discriminator를 training 하기 위해 다음의 objective function을 최적화함:
  (Eq. 1) $\mathcal{L}_{d}=\frac{1}{K}\sum_{i=1}^{K}\max(0,1-D_{k}(x))+\max(0,1+D_{k}(\hat{x}))$
  - $K$ : discriminator 수
- 추가적으로 adversarial loss를 각 discriminator의 logit에 대한 hinge loss로 정의할 수 있음:
  (Eq. 2) $\mathcal{L}_{adv}=\frac{1}{K}\sum_{i=1}^{K}\max(0,1-D_{k}(\hat{x}))$
- 마지막으로 생성된 audio에 대한 discriminator의 internal layer output과 해당 ground-truth audio에 대한 output 간의 average absolute difference를 취하여 feature loss를 계산:
  (Eq. 3) $\mathcal{L}_{feat}=\frac{1}{KL}\sum_{k=1}^{K}\sum_{l=1}^{L}\frac{|| D_{k}^{l}(x)-D_{k}^{l}(\hat{x})||_{1}}{\textrm{mean}(|| D_{k}^{l}(x)||_{1})}$
GRVQ Loss
- $i$-th group의 $c$-th residual quantizer에 대해 다음의 commitment loss를 얻을 수 있음:
  (Eq. 4) $\mathcal{L}_{c}=\sum_{i,c}|| z_{i,c}-q_{i,c}(z_{i,c})||_{2}^{2}$
- 결과적으로 다음의 formula를 통해 generator를 training 함:
  (Eq. 5) $\mathrm{Loss}_{G}=\lambda_{adv}\mathcal{L}_{adv}+\lambda_{feat}\mathcal{L}_{feat}+\lambda_{rec}\mathcal{L}_{rec}+\lambda_{c}\mathcal{L}_{c} $
  - $\mathcal{L}_{adv}$ : adversarial loss, $\mathcal{L}_{feat}$ : feature loss, $\mathcal{L}_{rec}$ : reconstruction loss, $\mathcal{L}_{c}$ : comiitment loss
  - $\lambda_{adv}, \lambda_{feat}, \lambda_{rec}, \lambda_{c}$ : hyperparameter

3. Experiments

- Settings

Dataset : LibriTTS, VCTK, AISHELL
Comparisons : EnCodec, SoundStream

- Results

HiFi-Codec은 4개의 codebook만을 사용하면서도 가장 좋은 reconstruction 성능을 보임
- 한편으로 downsampling time을 240, codebook을 8로 설정했을 때, 최고의 성능을 얻을 수 있음

'Paper > Neural Codec' 카테고리의 다른 글

[Paper 리뷰] EnCodecMAE: Leveraging Neural Codecs for Universal Audio Representation Learning (0)	2024.05.24
[Paper 리뷰] LMCodec: A Low Bitrate Speech Codec with Causal Transformer Models (0)	2024.05.22
[Paper 리뷰] SoundStorm: Efficient Parallel Audio Generation (0)	2024.04.26
[Paper 리뷰] SoundStream: An End-to-End Neural Audio Codec (0)	2024.04.21
[Paper 리뷰] EnCodec: High-Fidelity Neural Audio Compression (0)	2024.04.20

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] HiFi-Codec: Group-Residual Vector Quantization for High Fidelity Audio Codec

HiFi-Codec: Group-Residual Vector Quantization for High Fidelity Audio Codec

1. Introduction

2. Method

- Overview

- Encoder and Decoder

- Group-Residual Vector Quantization (GRVQ)

- Discriminator

- Training Loss

3. Experiments

- Settings

- Results

'Paper > Neural Codec' 카테고리의 다른 글

티스토리툴바