
UniAudio: Towards Universal Audio Generation with Large Language Models다양한 task를 unified manner로 처리할 수 있는 universal audio generation model이 필요함UniAudioLarge Language Model-based audio generation model을 구성해 phoneme, text description, audio 등의 다양한 input condition을 기반으로 speech, sound, music, singing voice 등을 생성Model performance와 efficiency를 향상하기 위한 audio tokenization과 language model architecture를 설..

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec TransformerLarge-scale text-to-speech system은 autoregressive/non-autoregressive 방식으로 나눌 수 있음- Autoregressive 방식은 robustness와 duration controllability 측면에서 한계가 있음- Non-auotregressive 방식은 training 중에 text, speech 간의 explicit alignment information이 필요함MaskGCTText, speech supervision 간의 explicit alignment information과 phone-level duratio..

Generative Pre-trained Speech Language Model with Efficient Hierarchical TransformerSpeech language model은 여전히 neural audio codec의 long acoustic sequence를 modeling 하는데 한계가 있음Generative Pre-trained Speech Transformer (GPST)Audio waveform을 2가지의 discrete speech representation으로 quantize 하고 hierarchical transformer architecture에 integrate 함End-to-End unsupervised manner로 train 됨으로써 다양한 speaker ident..

SpeechX: Neural Codec Language Model as a Versatile Speech TransformerAudio-text prompt 기반의 speech model은 text-to-speech 외의 다양한 task를 처리하는 데는 한계가 있음SpeechXZero-shot Text-to-Speech, Speech Editing, Noise Suppression, Target Speaker Extraction 등의 다양한 task를 지원하는 speech modelNeural codec language modeling과 task-dependent prompting에 기반한 multi-task learning을 도입논문 (TASLP 2024) : Paper Link1. Introducti..

Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal SupervisionMinimal supervision으로 train 할 수 있는 multi-speaker text-to-speech model이 필요함SPEAR-TTSText to High level semantic token (Reading), Semantic token to Low-level acoustic token (Speaking)의 2가지 discrete speech representation을 combining 하여 text-to-speech를 sequence-to-sequence task로 casting특히 abundant audio-only data를 사용하여 Speak..

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the WildSpeech editing, zero-shot text-to-speech를 위해 token infilling neural codec language model을 구성할 수 있음VocieCraftTransformer decoder architecture와 causal masking, delayed stacking을 결합하여 existing sequence 내에서 generation을 수행하는 token rearrangement를 도입추가적으로 speech editing evaluation을 위한 RealEdit dataset을 제공논문 (ACL 2024) : Paper Link1. Int..