FELLE: Autoregressive Speech Synthesis with Token-wise Coarse-to-Fine Flow MatchingLanguage modeling과 flow matching을 integrate 할 수 있음FELLELanguage model의 autoregressive nature와 flow matching의 generative efficacy를 기반으로 continuous-valued token을 predict추가적으로 coarse-to-fine flow matching mechanism을 통해 speech quality를 향상논문 (MM 2025) : Paper Link1. IntroductionVALL-E, VALL-E2와 같은 Large Language Model ..
PAST: Phonetic-Acoustic Speech TokenizerSignal reconstruction과 phonetic information을 jointly modeling 할 수 있음PASTPre-trained self-supervised model 없이 supervised phonetic data를 사용하여 auxiliary task를 통해 domain knowledge를 tokenization process에 integrate추가적으로 real-time application을 위한 streamable architecture를 구성논문 (INTERSPEECH 2025) : Paper Link1. IntroductionSpeech language model은 일반적으로 acoustic toke..
REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice ConversionSpeech Time Reversal은 speaker identification을 위한 tonal pattern을 가지고 있음REWINDTime-reversed speech에서 학습된 speaker representation을 활용한 augmentation strategy를 도입Diffusion-based voice conversion model에 적용하여 speaker의 unique vocal trait를 preserve 하면서 linguistic content의 interference를 minimize논문 (INTERSP..
Factorized RVQ-GAN for Disentangled Speech TokenizationBottleneck을 factorize 하는 neural codec을 구성할 수 있음HACPhoneme-level structure를 위한 pre-trained speech encoder와 lexical cue를 위한 text-based encoder의 objective를 활용하여 knowledge distillation objective를 구성Factorized bottleneck을 통해 phoneme align, word-level semantic에 대한 disentangled token set을 생성논문 (INTERSPEECH 2025) : Paper Link1. IntroductionNeural Sp..
LiteASR: Efficient Automatic Speech Recognition with Low-Rank ApproximationAutomatic Speech Recognition model은 encoder-decoder architecture로 인해 computationally-intense 함LiteASREncoder에 Low-Rank Compression을 적용하여 transcription accuracy를 maintain 하면서 inference cost를 절감Small calibration dataset을 활용하여 Principal Component Analysis를 적용하여 Linear transformation을 Low-Rank Matrix Multiplication chain으로 ap..
AxLSTMs: Learning Self-Supervised Audio Representations with xLSTMsxLSTM은 Transformer와 비교할만한 성능을 가짐AxLSTMSelf-supervised setting에서 xLSTM을 활용해 masked spectrogram patch로부터 general-purpose audio representation을 학습AudioSet dataset으로 pre-training 하여 다양한 downstream task에 대응논문 (INTERSPEECH 2025) : Paper Link1. IntroductionTransformer는 뛰어난 generalization ability와 data-agnostic nature를 가지지만 scaled dot-pr..
