Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration Coding기존의 speech tokenizer는 information density나 temporal fluctuation에 관계없이 고정된 token per second를 assign 하므로 speech의 intrinsic structure와 mismatch가 존재함VARSTokSpeech를 variable-length unit으로 adaptively segment 하는 Temporal-Aware Density Peak Clustering을 도입Content, temporal span을 single token in..
Scaling Transformers for Low-Bitrate High-Quality Speech Coding기존의 speech tokenization model은 대부분 strong inductive bias를 가지는 component를 사용한 low parameter-count architecture에 집중함TAAELarge parameter-count를 가지는 Transformer architecture를 사용하여 tokenization model을 scalingFinite Scalar Quantization-based bottleneck을 도입해 low bit-rate의 speech quality를 향상논문 (ICLR 2025) : Paper Link1. IntroductionSoundStre..
Variable Bitrate Residual Vector Quantization for Audio CodingNeural audio codec은 rate-distortion trade-off 측면에서 suboptimal 함VRVQFrame 당 사용되는 codebook 수를 adapting 하여 efficient coding을 지원Importance map을 binary importance mask로 transform 하는 non-differentiable masking operation에 대한 gradient estimation method를 도입논문 (ICASSP 2025) : Paper Link1. Introduction최근 SoundStream, EnCodec, DAC와 같은 Residual Ve..
PURE Codec: Progressive Unfolding of Residual Entropy for Speech Codec LearningNeural speech codec은 Residual Vector Quantization으로 인한 reconstruction의 한계가 있음PURE CodecPre-trained speech enhancement model을 활용하여 multi-stage quantization을 guidingFirst stage에서는 low-entropy, denoised speech embedding을 reconstruct 하고 second stage에서는 residual high-entropy component를 encode논문 (ASRU 2025) : Paper Link1. I..
Language-Codec: Bridging Discrete Codec Representations and Speech Language ModelsDiscrete acoustic codec은 speech language model에서 intermediate representation으로 사용됨Language-CodecMasked Channel Residual Vector Quantization을 도입하여 initial codebook의 excessive information 문제를 해결추가적으로 Fourier transform structure, attention block, refined discriminator를 적용논문 (ACL 2025) : Paper Link1. IntroductionVALL-E..
SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound대부분의 neural codec은 high bitrate에서 동작하고 narrow domain을 가짐SemantiCodecSpeech, general sound, music 등의 다양한 domain을 100 token/sec 이하의 token으로 compress$k$-means clustering을 통해 discretize 된 Self-Supervised Pre-Trained Audio Masked AutoEncoder와 acoustic encoder로 구성된 dual-encoder architecture를 활용논문 (JSTSP 2024) : Paper Link1. Intro..
