[Paper 리뷰] Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation

티스토리 뷰

Paper/Neural Codec

[Paper 리뷰] Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation

feVeRin 2024. 6. 23. 11:35

Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation

Multi-codebook speech codec은 multi-sequence prediction으로 인해 efficiency와 robustness에 bottleneck이 발생함
Single-Codec
- Disentangled VQVAE를 통해 speech를 time-invariant embedding과 phonetically-rich discrete sequence로 decouple 하는 single-codebook, single-sequence codec
- 특히 encoder에서
  1. Temporal information을 반영하는 BLSTM module을 통해 contextual modeling을 지원하고,
  2. Up/downsampling distortion을 완화하는 hybrid sampling module과 discrete unit이 더 많은 phonetic information을 전달하도록 하는 resampling module을 적용
논문 (INTERSPEECH 2024) : Paper Link

1. Introduction

Large Language Model (LLM)은 text-to-speech (TTS)에서 우수한 성능을 보이고 있음
- 해당 LLM-based TTS system은 음성 합성을 next-token prediction problem으로 처리하기 위해, speech tokenization과 waveform reconstruction에 대한 speech codec을 사용함
- 특히 SoundStorm과 같은 multi-codebook codec은 우수한 reconstruction 성능으로 LLM-based TTS에서 자주 활용되고 있음
  - BUT, LLM은 multi-sequence discrete representation을 사용할 때 stability와 efficiency의 한계가 있으므로, single-sequence discrete speech representation의 사용을 고려해야 함
- 한편으로 하나의 discrete token sequence 만으로 semantic, acoustic 측면의 abundant information을 완벽하게 represent 하는 것은 불가능함
  1. Tortoise-TTS와 같이 single-sequence discrete speech representation을 활용할 수 있더라도 latent embedding에서 mel-spectrogram을 생성하려면 diffusion model을 추가적으로 training 해야 함
    - 해당 embedding에는 compression loss를 compensate 하는 additional information이 포함되어 있지만, 더 많은 training/inference cost가 요구된다는 단점이 있음
  2. 최근의 TiCodec은 speech unit에서 time-invariant information을 disentangle하여 encoding에 필요한 frame-level information을 크게 줄임
    - 따라서 feature disentanglement 측면에서 speech codec을 접근하는 것이 필요함

-> 그래서 single-codebook neural speech codec인 Single-Codec을 제안

Single-Codec
- Raw waveform 대신 mel-spectrogram에서 compression/reconstruction을 수행하여 중요한 detail을 preserve 하면서 speech information을 효과적으로 compress 함
- 이때 codec 성능과 음성 합성에 대한 applicability를 향상하기 위해, 다음의 key component를 도입
  1. Time-invariant feature를 decouple 하는 global reference encoder
    - 다양한 acoustic detail을 capture하는 continuous global representation과 longer reference segment를 활용해 single-codebook discrete unit에 충분한 phonetic information을 embedding
  2. Contextual modeling을 위한 BLSTM module
    - Adjacent frame에 대한 correlation을 discover 하고 speech content clustering을 향상
  3. Up/downsampling distortion을 완화하는 Hybrid sampling module
    - Downsampling을 위해 convolution, pooling을 모두 사용하고, upsampling을 위해 transposed convolution과 replication을 사용
  4. Resampling module
    - Encoder가 acoustic sequence에서 lower short-time variance로 더 많은 phonetics-relevant information을 추출하도록 지원

< Overall of Single-Codec >

VQVAE를 기반으로 time-invariant embedding과 phonetically-rich discrete sequence로 decouple 하는 single-codebook, single-sequence codec
Reference encoder, BLSTM, Hybrid sampling, Resampling module을 도입
결과적으로 기존 neural codec 보다 뛰어난 성능을 달성

2. Method

- Architecture of Single-Codec

Single-Codec architecture는 아래 그림과 같이, mel-spectrogram input과 reconstruction을 갖춘 Vector Quantized Variational AutoEncoder (VQVAE)를 기반으로 함
- 즉, mel-spectrogram segment $s e g 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>s</mi><mi>e</mi><msub><mi>g</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub></math>$ 를 latent content representation $c <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>c</mi></math>$ 로 encoding 하는 Conformer-based encoder와 vector quantization을 위한 Vector Quantizer (VQ)를 활용함
- 한편으로 convolution-based decoder는 quantized representation $c <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>c</mi></math>$ 로부터 mel-spectrogram $~ s e g 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mrow><mi>s</mi><mi>e</mi><mi>g</mi></mrow><mo stretchy="false">~</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub></math>$ 를 reconstruction 함
- 추가적으로 생성 품질을 향상하기 위해 discriminator를 도입하고, BigVGAN vocoder를 사용하여 codec output에서 waveform을 reconstruct 함
- 특히 논문은 고품질의 single-codebook codec을 위해, 다음의 4개의 module을 도입함
  1. 구체적으로, mel-sepctrogram segment $s e g 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>s</mi><mi>e</mi><msub><mi>g</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$ 에서 음성의 time-invariant information을 decouple 하는 reference encoder를 도입하여 global representation $g <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>g</mi></math>$ 를 생성
  2. Sampling loss를 완화하기 위한 hybrid sampling module
  3. Contextual information과 phonetics-relevant information을 향상하는 BLSTM module과 resampling module

- Reference Encoder

음성에는 time-variant content, time-invariant timbre, acoustic environment와 같은 다양한 information이 포함되어 있음
- BUT, single-codebook codec의 경우 모든 information들을 제한된 discrete unit으로 compress 하는 것이 어려움
  - 따라서 Single-Codec에서는 timbre, acoustic environment와 같이 utterance의 모든 frame에서 거의 invariable 한 global information을 decouple 하고, speech content를 code로 discretize 함
- 특히 해당 timbre와 acoustic environment와 관련된 global representation $g <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>g</mi></math>$ 를 얻기 위해 reference encoder를 채택함
  1. 먼저 reference encoder의 input으로 input utterance에서 randomly select 된 segment $s e g 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>s</mi><mi>e</mi><msub><mi>g</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$ 을 사용
  2. 이후 reference input segment $s e g 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>s</mi><mi>e</mi><msub><mi>g</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$ 의 length를 600 frame으로 설정하고, codec encoder input segment $s e g 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>s</mi><mi>e</mi><msub><mi>g</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub></math>$ 의 length를 200 frame으로 설정
    - 짧은 segment $s e g 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>s</mi><mi>e</mi><msub><mi>g</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub></math>$ 는 calculation과 memory overhead를 줄일 수 있고, 긴 segment $s e g 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>s</mi><mi>e</mi><msub><mi>g</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></math>$ 은 robust global feature를 얻는데 도움을 줌
  3. 최종적으로 reference encoder output $g <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>g</mi></math>$ 는 서로 다른 linear layer를 통과한 다음 codec encoder와 decoder에 전달
    - 여기서 encoder block의 output을 빼고 decoder block의 input에 더함

- BLSTM Module

일반적으로 codec은 generalization을 보장하기 위해 large-scale speech data로 training 되므로, speech content의 diversity로 인해 single-codebook codec을 활용하기 어려움
- 따라서 논문은 LSTM에 기반한 sequence modeling을 도입한 EnCodec과 달리, contextual information을 향상하기 위해 quantizer 전후에 BLSTM module을 도입함
- 결과적으로 해당 BLSTM module을 통해 speech content modeling을 개선하고 stable clustering center를 더 쉽게 형성할 수 있음

- Hybrid Sampling Module

Neural codec은 discrete representation의 sequence length를 줄이기 위해 sampling module을 활용함
- 특히 기존 codec에서 up/downsampling은 convolution/transposed convolution이나 pooling/repeat를 통해 구현됨
  - 이때 해당 sampling 과정은 필연적으로 sampling loss가 발생하므로, 전체 encoding/decoding 성능 저하로 이어짐
- 따라서 Single-Codec은 MR-HuBERT와 같이, convolution과 pooling을 모두 사용하여 downsampling을 수행하고 transposed convolution과 replication을 통해 upsampling을 수행하는 hybrid sampling module을 도입함
  - 결과적으로 해당 hybrid sampling module은 여러 sampling method를 조합함으로써 sampling distortion을 완화할 수 있음

- Resampling Module

Single-codebook speech codec은 acoustic representation에서 short-term invariant speech unit을 추출하는 것을 목표로 함
- BUT, acoustic representation의 diversity로 인해 codebook vector learning은 어려움이 있음
- 따라서 논문은 resampling module을 통해 local modeling을 위한 input feature를 downsampling 하고 upsampling 이후 residual connection을 적용하는 방식을 채택함
  - 이때 time-axis를 따라 bottleneck operation이 발생하므로, encoder는 acoustic sequence에서 lower short-time variance로 더 많은 phonetics-relevant information을 추출할 수 있음

3. Experiments

- Settings

Dataset : LibriTTS, HiFi-TTS, VCTK, AISHELL-1, AISHELL-3
Comparisons : EnCodec, TiCodec

- Results

Ablation Study
- Single-Codec에 대해 다음의 ablation을 비교
  - VQVAE : 기본적인 VQVAE codec
  - Ref-short : 200 frame의 short segment를 input으로 사용하는 reference encoder를 포함한 VQVAE
  - Ref-long : 600 frame의 long segment를 input으로 사용하는 reference encoder를 포함한 VQVAE
  - Ref-BLSTM : BLSTM module을 사용한 Ref-long
  - Ref-HybSam : Hybrid sampling module을 사용한 Ref-long
  - Ref-BLSTM-HybSam : BLSTM+Hybrid sampling을 사용한 Ref-long
  - Ref-BLSTM-HybSam-Conf : resampling module 대신 Conformer-based encoder를 사용한 Ref-BLSTM-HybSam
- 먼저 일반적인 VQVAE와 비교하여 Ref-long/Ref-short는 더 나은 성능을 보임
  - 즉, single-codebook codec에서 global information을 decouple 하는 것이 효과적임
- Ref-long은 reconstruction과 speaker similarity 측면에서 Ref-short 보다 뛰어남
- Ref-BLSTM-HybSam, Ref-BLSTM, Ref-HybSam 모두 Ref-long과 비교하여 더 높은 reconstruction 성능을 달성함
- Ref-BLSTM-HybSam-Conf는 Ref-BLSTM-HybSam 수준의 성능을 달성할 수 있지만, Single-Codec의 결과와 같이 resampling module을 사용했을 때 최상의 성능을 달성할 수 있음
Speech Reconstruction Evaluation
- 결과적으로 Single-Codec은 기존의 EnCodec, TiCodec 보다 뛰어난 성능을 보임

Commitment Loss Analysis
- Simple VQVAE의 commitment loss는 발산하는 경향이 있음
  - Time-invariant global information과 time-variant content information의 entanglement로 인해 content-related speech unit을 forming 하는 것을 방해받기 때문
- 반면 time-invariant decoupled modeling을 고려한 Ref-short, Ref-long은 발산 정도가 크게 줄어드는 것으로 나타남
- 이때 Ref-BLSTM-HybSam과 같이 BLSTM, Hybrid sampling을 적용하면 모델은 보다 안정적으로 수렴할 수 있음

Zero-shot TTS Evaluation
- VALL-E에 각 codec을 적용하여 zero-shot TTS 성능을 비교해 보면,
- 마찬가지로 Single-Codec을 사용했을 때 가장 뛰어난 합성 성능을 달성함

'Paper > Neural Codec' 카테고리의 다른 글

[Paper 리뷰] RepCodec: A Speech Representation Codec for Speech Tokenization (0)	2025.02.22
[Paper 리뷰] Generative De-quantization for Neural Speech Codec via Latent Diffusion (0)	2024.07.18
[Paper 리뷰] ScoreDec: A Phase-Preserving High-Fidelity Audio Codec with a Generalized Score-based Diffusion Post-Filter (0)	2024.06.21
[Paper 리뷰] Fewer-Token Neural Speech Codec with Time-Invariant Codes (0)	2024.06.13
[Paper 리뷰] CQNV: A Combination of Coarsely Quantized Bitstream and Neural Vocoder for Low Rate Speech Coding (0)	2024.06.12

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation

Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation

1. Introduction

2. Method

- Architecture of Single-Codec

- Reference Encoder

- BLSTM Module

- Hybrid Sampling Module

- Resampling Module

3. Experiments

- Settings

- Results

'Paper > Neural Codec' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역