
MELLE: Autoregressive Speech Synthesis without Vector QuantizationText-to-Speech를 위해 continuous-valued token based language modeling을 활용할 수 있음MELLESpectrogram Flux loss를 사용하여 continuous-valued token distribution을 modelingVariational inference를 incorporate 하여 diversity, robustness를 향상논문 (ACL 2025) : Paper Link1. IntroductionNext-token prediction은 previous token을 condition으로 하여 next discrete token..

DiTAR: Diffusion Transformer Autoregressive Modeling for Speech GenerationDiffusion model과 autoregressive model을 결합하면 computational load와 suboptimal outcome이 발생함DiTARPatch generation을 위해 divide-and-conquer strategy를 도입Langauge model은 aggregated patch embedding을 처리한 다음, diffusion Transformer를 통해 next patch를 subsequently generate추론 시에는 reverse diffusion ODE 중 noise introducing time point를 temperat..

ELLA-V: Stable Neural Codec Language Modeling with Alignment-Guided Sequence RecordingAcoustic, linguistic prompt에 기반한 language model은 zero-shot audio synthesis에서 우수한 성능을 보임ELLA-VPhoneme level에서 synthesized audio에 대한 fine-grained control을 지원Acoustic token ahead에 phoneme token이 appear 할 때 acoustic, phoneme token sequence를 interleaving논문 (AAAI 2025) : Paper Link1. IntroductionZero-shot Text-to-Spe..

Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask LearnersVoice Large Language Model은 대부분 single task, monolingual로 제한됨Make-A-VoiceEnd-to-End local/global multiscale transformer를 활용하여 scalable learner를 구성Common knowledge를 share 하고 unseen task에 generalize 하여 in-context learning을 향상Low-resource language에 대한 data scarcity 문제를 해결하는 multilingual learner를 지원논문..

Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech SynthesisSpeech synthesis를 위해 autoregressive modeling을 활용할 수 있음CAMMulti-modal latent space를 가지는 Variational AutoEncoder, conditional probability distribution으로써 Gaussian Mixture Model을 활용하는 autoregressive model을 활용특히 Variational AutoEncoder의 latent space에서 continuous speech representation을 통해 training/inference pip..

CosyVoice: A Scalable Multilingual Zero-Shot Text-to-Speech Synthesizer based on Supervised Semantic TokensLarge Language Model-based Text-to-Speech에서 speech token은 unsupervised manner로 학습됨- 즉, explict semantic information, text alignment information이 부족함CosyVoiceEncoder에 vector quantization을 inserting 하여 multilingual speech recognition model에서 derive 된 supervised semantic token을 활용해당 token을 기반으..