[Paper 리뷰] CQNV: A Combination of Coarsely Quantized Bitstream and Neural Vocoder for Low Rate Speech Coding

티스토리 뷰

Paper/Neural Codec

[Paper 리뷰] CQNV: A Combination of Coarsely Quantized Bitstream and Neural Vocoder for Low Rate Speech Coding

feVeRin 2024. 6. 12. 10:07

CQNV: A Combination of Coarsely Quantized Bitstream and Neural Vocoder for Low Rate Speech Coding

기존 neural codec architecture 내에는 parameter quantization의 redundancy가 나타남
CQNV
- Parameteric codec의 coarsely quantized parameter를 neural vocoder와 결합한 neural codec
- Parameter processing module을 도입해 speech coding parameter의 bitstream을 강화하고 reconstruction 품질을 개선
논문 (INTERSPEECH 2023) : Paper Link

1. Introduction

Speech coding은 information transmission과 storage에 필요한 bandwidth와 cost를 효과적으로 줄일 수 있음
- 이를 위해 기존에는 MELP, Codec2와 같은 parametric codec을 활용함
  - Codebook에 의존하여 speech parameter를 quantize 하고 original과 유사한 음성을 생성하는 방식
  - BUT, 이러한 parameteric codec은 bitrate가 감소하면, quantization error의 증가로 인해 성능 저하가 발생함
- 한편으로 encoder-decoder architecture를 활용하는 nerual codec은 낮은 bitrate에서도 compact latent representation을 통해 좋은 성능을 보이고 있음
  1. 대표적으로 SoundStream은 24kHz audio를 3~18kps의 bitrate로 효과적으로 compress 함
  2. EnCodec의 경우 quantized latent space를 활용해 0.9kpbs의 bandwidth로 compress 할 수 있음
    - 이러한 end-to-end neural codec은 낮은 bitrate에서 뛰어난 성능을 보이고 있지만, blackbox 특성으로 인해 추가 개선의 한계가 있음
- 이때 neural vocoder-based approach를 사용하면 end-to-end architecture에 더 많은 interpretability를 제공할 수 있음
  - 특히 bitstream은 traditional encoder로 encoding 된 다음 neural vocoder로 decoding 할 수 있으므로, LPCNet, StyleMelGAN 등을 결합하여 높은 수준의 parallelization을 지원 가능
- BUT, neural vocoder-based approach는 parameteric codec과 결합하여 low-bitrate에서 우수한 성능을 달성할 수 있지만 다음의 문제점을 해결해야 함
  1. Parameter Quantization의 Redundancy
    - 기존 speech coding parameter의 bitstream에는 neural vocoder에서 사용되지 않는 redundancy가 존재함
  2. Possible Performance Limitations
    - Neural vocoder는 일반적으로 mel-spectrogram과 같은 frame-based intermediate acoustic representation을 input으로 사용함
    - BUT, speech coding에서 사용되는 vocoder는 speech coding parameter를 input으로 사용하므로 context의 차이로 인해 성능 제한이 발생할 수 있음

-> 그래서 neural vocoder-based codec의 단점을 해결하기 위해 coarsely quantized speech parameter를 활용하는 CQNV를 제안

CQNV
- Codec2에서 얻은 speech parameter를 coarsely quantizing 하여 bitrate를 compress 함
- 이후 coarsely quantized parameter를 HiFi-GAN vocoder에 전달하여 redundancy 문제를 해결함
- 추가적으로 서로 다른 dilation rate를 가진 convolution layer로 구성된 3-branch의 parameter processing module을 도입하여 고품질의 reconstruction을 지원

< Overall of CQNV >

Coarsely quantized parameter를 활용하여 음성의 bitrate를 줄이면서 높은 품질을 유지
Neural vocoder의 성능을 향상할 수 있는 parameter processing module을 도입하고, training을 위한 dynamic hyperparameter를 적용
결과적으로 기존보다 3배 더 적은 bitrate를 사용하면서도 우수한 품질을 달성

2. Method

CQNV는 Codec2의 encoder, de-quantizer와 HiFi-GAN vocoder로 구성됨
- Parameter quantizer는 encoder를 redesing 하여 train 됨
  - 여기서 encoder는 8kHz로 sampling 된 source speech를 더 적은 bit로 quantize 함
- Decoder에서 HiFi-GAN은 de-quantizer의 parameter로 16kHz의 speech를 사용함

- Encoder

Codec2는 10ms length의 speech frame에서 parameter를 추출하고, 1.2kpbs에 대해 4개의 consecutive frame은 jointly quantize 됨
- 추가적으로 27, 16, 4 bit는 Line Spectrum Pair (LPS), pitch, energy, voicing level에 대한 parameter를 quantizing 하는 데 사용됨
- CQNV에서는 bitrate를 더욱 줄이고 더 적인 information으로 neural vocoder의 generation power를 향상하기 위해 LSP, pitch, energy에 대한 coarse quantization을 채택함
  - 여기서 voicing level은 각 frame이 one-hot encoding으로 characterize 되므로 추가적인 bitrate 감소가 불가능함
Quantization of LPS
- LSP는 4 frame마다 sampling 되고, 그 사이의 3 frame은 de-quantizer에서 interpolate 됨
- 논문에서는 LSP에 대해 two-stage vector quantization과 split-vector quantization을 사용함
  1. 10-order LSP의 경우 first stage에서 512 size의 codebook을 사용하여 quantize 된 다음, quantization residual이 계산됨
    - 즉, code vector를 index 하는데 9 bit가 사용된다는 것을 의미
  2. Second stage에서 quantization residual은 odd-order/even-order component로 구성된 2개의 5-dimensional vector로 split 된 다음, 128의 size를 가진 2개의 codebook을 사용하여 independently quantize 됨
    - 해당 operation을 통해 index bit는 9에서 7로 감소함
- 여기서 모든 codebook은 LBG algorithm을 사용하여 training 됨
- 결과적으로 LSP quantization은 4개의 consecutive frame에 대한 bit 수를 27에서 23으로 줄임
Joint Quantization of Pitch and Energy
- Pitch와 energy는 joint vector quantization으로 quantize 됨
  - 즉, 2 frame마다 sampling 되고, 그 사이의 frame은 decoder에서 interpolate 됨
- Codebook dimension을 2라고 했을 때, lowest frequency에 대한 pitch의 $\log_{2}$와 dB 단위 energy는:
  (Eq. 1) $x_{p}=\log_{2}\left(\frac{W_{O}}{\pi}\cdot \frac{4000}{50}\right)$
  (Eq. 2) $x_{e}=10\times \log_{10}(e+10^{-4})$
  - $W_{o}$ : pitch, $e$ : energy
- Codec2에서는 pitch와 energy를 quantizing 하기 위해, 1.2kpbs에서 256 size의 codebook을 사용함
  - 결과적으로 CQNV는 4개의 consecutive frame으로 구성된 packet의 bit를 16에서 12로 줄이고, 64 size의 codebook을 training 함
- 여기서 codebook은 prediction residual과 inter-frame correlation을 결합하여 구성됨
  - 즉, current frame의 prediction vector는 prediction coefficient와 previous frame을 기반으로 함
  - Prediction coefficient는 pitch에서 0.8, energy에서 0.2를 사용
- 추가적으로 codebook은 prediction과 initial vector로 계산된 residual vector에서 training 됨
  - 이때 다양한 feature에 대한 weighted error를 계산함
  - Pitch, energy error는 stationary speech에 대해 non-stationary speech나 silence보다 higher weight가 주어짐

- Decoder

Decoder에서 de-quantizer는 bitstream으로부터 parameter를 얻은 다음, HiFi-GAN을 condition 하여 speech signal을 생성함
- 먼저 HiFi-GAN을 condition 하기 위해, CQNV는 23-dimension feature, LSP, energy, pitch, voicing level, Linear Prediction Coefficient (LPC)를 사용함
- 이때 해당 feature를 network에 직접 전달하는 대신, 추가적인 parameter processing module을 도입함
  1. 해당 module은 서로 다른 dilation rate를 가지는 convolution layer로 구성된 3-branch 구조를 통해 다양한 receptive field size의 feature map을 생성함
  2. 이를 위해 먼저 pitch를 element-wise로 mulitply 하여 해당 module을 통해 representation을 얻음
    - 여기서 pitch는 고품질 합성을 위해 다른 conditioning parameter와는 개별적으로 처리됨
  3. 다음으로 LSP, energy, normalized LPC parameter를 동일한 block으로 전달하여 또 다른 representation을 얻음
  4. 최종적으로 두 representation은 HiFi-GAN을 condition 하기 위해 concatenate 됨
    - 결과적으로 해당 module을 통해 neural vocoder는 speech coding parameter를 기반으로 효과적인 reconstruction이 가능
- HiFi-GAN은 구조적으로 1개의 generator와 2개의 discriminator로 구성됨
  1. 특히 GAN loss와 2개의 additional feature loss를 결합하여 training stability와 성능을 개선함
  2. 여기서 CQNV는 기존 HiFi-GAN의 feature matching loss에 대한 fixed hyperparameter를 대체하는 dynamic hyperparmeter $\lambda_{FM}$을 도입함
    - 이는 training 중 mel-spectrogram loss와 feature matching loss의 ratio로 정의됨
- 결과적으로 generator, discriminator에 대한 final objective는:
  (Eq. 3) $\mathcal{L}_{G}=\mathcal{L}_{adv}(G;D)+\lambda_{FM}\mathcal{L}_{FM}(G;D)+\lambda_{mel}\mathcal{L}_{mel}(G)$
  (Eq. 4) $\mathcal{L}_{D}=\mathcal{L}_{adv}(D;G)$
  - $\lambda_{mel}=45, \lambda_{FM}=\mathcal{L}_{mel}(G)/\mathcal{L}_{FM}(G;D)$

3. Experiments

- Settings

Dataset : VCTK
Comparisons : EnCodec, Lyra, Opus, Codec2, Speex

- Results

Subjective Test
- MUSHRA test 측면에서 CQNV는 가장 우수한 성능을 보임

추가적으로 합성된 sample의 mel-spectrogram을 비교해 보면, CQNV는 low-/medium-frequency band에서 original signal과 가장 잘 match 됨

Mel-spectrogram 비교 (a) Ground-Truth (b) Lyra (c) EnCodec (d) CQNV

Objective Test
- ViSQOL 측면에서도 CQNV는 가장 좋은 성능을 달성함

Ablation Study
- Ablation study 측면에서, parameter processing module과 dynamic hyperparameter를 제거하는 경우 성능저하가 발생함
- 즉, 각 component는 CQNV 성능 개선에 유효함

'Paper > Neural Codec' 카테고리의 다른 글

[Paper 리뷰] ScoreDec: A Phase-Preserving High-Fidelity Audio Codec with a Generalized Score-based Diffusion Post-Filter (0)	2024.06.21
[Paper 리뷰] Fewer-Token Neural Speech Codec with Time-Invariant Codes (0)	2024.06.13
[Paper 리뷰] SRCodec: Split-Residual Vector Quantization for Neural Speech Codec (0)	2024.06.06
[Paper 리뷰] High-Fidelity Audio Compression with Improved RVQGAN (0)	2024.05.31
[Paper 리뷰] EnCodecMAE: Leveraging Neural Codecs for Universal Audio Representation Learning (0)	2024.05.24

최근에 올라온 글

최근에 달린 댓글

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] CQNV: A Combination of Coarsely Quantized Bitstream and Neural Vocoder for Low Rate Speech Coding

CQNV: A Combination of Coarsely Quantized Bitstream and Neural Vocoder for Low Rate Speech Coding

1. Introduction

2. Method

- Encoder

- Decoder

3. Experiments

- Settings

- Results

'Paper > Neural Codec' 카테고리의 다른 글

티스토리툴바