
Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech SynthesisSpeech synthesis를 위해 autoregressive modeling을 활용할 수 있음CAMMulti-modal latent space를 가지는 Variational AutoEncoder, conditional probability distribution으로써 Gaussian Mixture Model을 활용하는 autoregressive model을 활용특히 Variational AutoEncoder의 latent space에서 continuous speech representation을 통해 training/inference pip..

CosyVoice: A Scalable Multilingual Zero-Shot Text-to-Speech Synthesizer based on Supervised Semantic TokensLarge Language Model-based Text-to-Speech에서 speech token은 unsupervised manner로 학습됨- 즉, explict semantic information, text alignment information이 부족함CosyVoiceEncoder에 vector quantization을 inserting 하여 multilingual speech recognition model에서 derive 된 supervised semantic token을 활용해당 token을 기반으..

HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech SynthesisDiscrete audio token을 활용하는 Large Language Model 기반의 text-to-speech model은 high frame rate로 인해 long-form speech synthesis가 어려움HALL-EMulti-Resolution Requantization을 도입해 neural audio codec의 frame rate를 절감- 이때 teacher-student distillation으로 discrete audio token을 reorganize 하는 Multi-Resolution Residual..

UniAudio: Towards Universal Audio Generation with Large Language Models다양한 task를 unified manner로 처리할 수 있는 universal audio generation model이 필요함UniAudioLarge Language Model-based audio generation model을 구성해 phoneme, text description, audio 등의 다양한 input condition을 기반으로 speech, sound, music, singing voice 등을 생성Model performance와 efficiency를 향상하기 위한 audio tokenization과 language model architecture를 설..

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec TransformerLarge-scale text-to-speech system은 autoregressive/non-autoregressive 방식으로 나눌 수 있음- Autoregressive 방식은 robustness와 duration controllability 측면에서 한계가 있음- Non-auotregressive 방식은 training 중에 text, speech 간의 explicit alignment information이 필요함MaskGCTText, speech supervision 간의 explicit alignment information과 phone-level duratio..

Generative Pre-trained Speech Language Model with Efficient Hierarchical TransformerSpeech language model은 여전히 neural audio codec의 long acoustic sequence를 modeling 하는데 한계가 있음Generative Pre-trained Speech Transformer (GPST)Audio waveform을 2가지의 discrete speech representation으로 quantize 하고 hierarchical transformer architecture에 integrate 함End-to-End unsupervised manner로 train 됨으로써 다양한 speaker ident..