
Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask LearnersVoice Large Language Model은 대부분 single task, monolingual로 제한됨Make-A-VoiceEnd-to-End local/global multiscale transformer를 활용하여 scalable learner를 구성Common knowledge를 share 하고 unseen task에 generalize 하여 in-context learning을 향상Low-resource language에 대한 data scarcity 문제를 해결하는 multilingual learner를 지원논문..

Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech SynthesisSpeech synthesis를 위해 autoregressive modeling을 활용할 수 있음CAMMulti-modal latent space를 가지는 Variational AutoEncoder, conditional probability distribution으로써 Gaussian Mixture Model을 활용하는 autoregressive model을 활용특히 Variational AutoEncoder의 latent space에서 continuous speech representation을 통해 training/inference pip..

CosyVoice: A Scalable Multilingual Zero-Shot Text-to-Speech Synthesizer based on Supervised Semantic TokensLarge Language Model-based Text-to-Speech에서 speech token은 unsupervised manner로 학습됨- 즉, explict semantic information, text alignment information이 부족함CosyVoiceEncoder에 vector quantization을 inserting 하여 multilingual speech recognition model에서 derive 된 supervised semantic token을 활용해당 token을 기반으..

HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech SynthesisDiscrete audio token을 활용하는 Large Language Model 기반의 text-to-speech model은 high frame rate로 인해 long-form speech synthesis가 어려움HALL-EMulti-Resolution Requantization을 도입해 neural audio codec의 frame rate를 절감- 이때 teacher-student distillation으로 discrete audio token을 reorganize 하는 Multi-Resolution Residual..

UniAudio: Towards Universal Audio Generation with Large Language Models다양한 task를 unified manner로 처리할 수 있는 universal audio generation model이 필요함UniAudioLarge Language Model-based audio generation model을 구성해 phoneme, text description, audio 등의 다양한 input condition을 기반으로 speech, sound, music, singing voice 등을 생성Model performance와 efficiency를 향상하기 위한 audio tokenization과 language model architecture를 설..

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec TransformerLarge-scale text-to-speech system은 autoregressive/non-autoregressive 방식으로 나눌 수 있음- Autoregressive 방식은 robustness와 duration controllability 측면에서 한계가 있음- Non-auotregressive 방식은 training 중에 text, speech 간의 explicit alignment information이 필요함MaskGCTText, speech supervision 간의 explicit alignment information과 phone-level duratio..