E3-TTS: Easy End-to-End Diffusion-based Text to SpeechEnd-to-End diffusion-based Text-to-Speech model을 활용하여 high-fidelity speech를 얻을 수 있음E3-TTSPlain text를 input으로 하여 iterative refinement process를 통해 waveform을 생성특히 spectrogram feature, alignment information과 같은 intermediate representation에 의존하지 않음논문 (ASRU 2023) : Paper Link1. IntroductionWaveGrad, DiffWave 등과 같이 Text-to-Speech (TTS) system에 Diffu..
E2-TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTSHigh speaker similarity, intelligibility를 가지는 zero-shot Text-to-Speech model이 필요함E2-TTSText input을 filler token을 가지는 character sequence로 convert 하여 사용Flow-Matching-based mel-spectrogram generator를 audio infilling task를 기반으로 training 하고 duration model과 같은 additional component에 대한 의존성을 제거논문 (SLT 2024) : Paper Link1. IntroductionVALL..
Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear Equation기존의 Self-Supervised Learning model은 speaker identity를 fully disentangle 하지 못함Eta-WavLMSelf-Supervised Learning representation을 speaker-specific, speaker-independent component로 linearly decompose이후 linearly decomposed feature로부터 speaker disentangled representation을 생성논문 (ACL 2025)..
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow MatchingDiffusion Transformer를 기반으로 fully non-autoregressive text-to-speech system을 구성할 수 있음F5-TTSInput을 ConvNeXt로 modeling 하여 text representation을 refine 하고 easier align을 보장Sway Sampling을 Flow Matching-based model에 적용하여 효과적인 training/inference를 지원논문 (ACL 2025) : Paper Link1. IntroductionVALL-E와 같은 Text-to-Speech (TTS) model은 f..
ALMTokenizer: A Low-Bitrate and Semantic-Rich Audio Codec Tokenizer for Audio Language ModelingAudio token을 audio language model에서 중요하게 사용됨ALMTokenizerFrame 간의 context information을 explicitly modeling 하여 learnable query token set을 통해 holistic information을 capture 하는 Query-based Compression Strategy를 도입Semantic information을 향상하기 위해 Masked AutoEncoder, Semantic prior-based Vector Quantization, Aut..
EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice ConversionEmotional Voice Conversion은 linguistic content는 preserve 하면서 source emotion을 주어진 target으로 convert 하는 것을 목표로 함EmoRegEmotion intensity를 control 하기 위해 Self-Supervised Learning-based feature representation을 활용추가적으로 emotional embedding space에서 Unsupervised Directional Latent Vector Mod..
TTS-Transducer: End-to-End Speech Synthesis with Neural TransducerText-to-Speech를 위해 neural transducer를 활용할 수 있음TTS-TransducerTransducer architecture를 사용하여 tokenized text, speech codec token 간의 first codebook에 대한 monotonic alignment를 학습Non-autoregressive Transformer를 기반으로 transducer loss에서 추출된 alignment를 사용해 remaining code를 predict논문 (ICASSP 2025) : Paper Link1. IntroductionText-to-Speech (TTS)는..
SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic Organization in HuBERTSpeech의 sentence-level representation을 학습하여 syllabic organization을 emerge 할 수 있음SD-HuBERTEntire speech를 summarize 하는 aggregator token으로 pre-trained HuBERT를 fine-tuningSupervision 없이 self-distillation objective를 사용하여 salient syllabic structure를 draw추가적으로 Spoken Speech ABX benchmark를 활용하여 sentence-level representati..
M2R-Whisepr: Multi-Stage and Multi-Scale Retrieval Augmentation for Enhancing WhisperWhisper는 다양한 subdialect를 acculately recognize 하는데 한계가 있음M2R-WhisperIn-Context Learning과 Retrieval-Augmented technique을 Whisper에 도입Pre-processing stage에서 sentence-level in-context learning을 적용하고 post-processing stage에서는 token-level $k$-Nearest Neighbor를 적용논문 (ICASSP 2025) : Paper Link1. IntroductionWhisper는 Autom..
