
MELLE: Autoregressive Speech Synthesis without Vector QuantizationText-to-Speech를 위해 continuous-valued token based language modeling을 활용할 수 있음MELLESpectrogram Flux loss를 사용하여 continuous-valued token distribution을 modelingVariational inference를 incorporate 하여 diversity, robustness를 향상논문 (ACL 2025) : Paper Link1. IntroductionNext-token prediction은 previous token을 condition으로 하여 next discrete token..

UniCodec: Unified Audio Codec with Single Domain-Adaptive CodebookMulti-domain audio signal을 지원하는 audio codec이 필요함UniCodec각 audio domain의 distinct characterisitc을 capture 하기 위해 domain-adaptive codebook과 Mixture-of-Expert strategy를 활용Auxiliary module 없이 codec의 semantic density를 enrich 하기 위해 self-supervised mask prediction modeling approach를 적용논문 (ACL 2025) : Paper Link1. IntroductionSpeech Langua..

OZSpeech: One-Step Zero-Shot Speech Synthesis with Learned-Prior-Conditioned Flow MatchingWaveform, spectrogram과 같은 기존의 speech representation은 speech attribute를 overlooking 하고 high computational cost를 가짐OZSpeechOne-step sampling과 learned prior를 condition으로 사용하여 sampling step 수를 reduceToken format의 disentangled, factorized component를 활용하여 speech attributre를 modeling논문 (ACL 2025) : Paper Link1. In..

DiTAR: Diffusion Transformer Autoregressive Modeling for Speech GenerationDiffusion model과 autoregressive model을 결합하면 computational load와 suboptimal outcome이 발생함DiTARPatch generation을 위해 divide-and-conquer strategy를 도입Langauge model은 aggregated patch embedding을 처리한 다음, diffusion Transformer를 통해 next patch를 subsequently generate추론 시에는 reverse diffusion ODE 중 noise introducing time point를 temperat..

BEATs: Audio Pre-Training with Acoustic TokenizersGeneral audio representation pre-training을 위헌 Self-Supervised Learning framework가 필요함BEATsSemantic-rich acoustic tokenizer에서 얻어지는 label에 대한 discrete label prediction task를 활용Tokenizer와 pre-trained model에 대한 iterative pipeline을 구성논문 (ICML 2023) : Paper Link1. IntroductionWav2Vec 2.0, HuBERT, WavLM, Data2Vec 등의 speech Self-Supervised Learning (SSL)..

TCSinger2: Customizable Multilingual Zero-Shot Singing Voice Synthesis기존의 Singing Voice Synthesis는 다양한 prompt를 통한 multi-level style control이 부족함TCSinger2Blurred Boundary Content Encoder를 통해 duration을 predict 하고, content embedding을 extend 하여 smooth transition을 지원Custom Audio Encoder를 통해 singing, speech, textual prompt에서 aligned representation을 추출추가적으로 Flow-based Custom Encoder를 활용하여 style modelin..

E3-TTS: Easy End-to-End Diffusion-based Text to SpeechEnd-to-End diffusion-based Text-to-Speech model을 활용하여 high-fidelity speech를 얻을 수 있음E3-TTSPlain text를 input으로 하여 iterative refinement process를 통해 waveform을 생성특히 spectrogram feature, alignment information과 같은 intermediate representation에 의존하지 않음논문 (ASRU 2023) : Paper Link1. IntroductionWaveGrad, DiffWave 등과 같이 Text-to-Speech (TTS) system에 Diffu..

E2-TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTSHigh speaker similarity, intelligibility를 가지는 zero-shot Text-to-Speech model이 필요함E2-TTSText input을 filler token을 가지는 character sequence로 convert 하여 사용Flow-Matching-based mel-spectrogram generator를 audio infilling task를 기반으로 training 하고 duration model과 같은 additional component에 대한 의존성을 제거논문 (SLT 2024) : Paper Link1. IntroductionVALL..

Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear Equation기존의 Self-Supervised Learning model은 speaker identity를 fully disentangle 하지 못함Eta-WavLMSelf-Supervised Learning representation을 speaker-specific, speaker-independent component로 linearly decompose이후 linearly decomposed feature로부터 speaker disentangled representation을 생성논문 (ACL 2025)..