DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech GenerationNeural audio codec은 frame rate와 audio quality 간의 trade-off를 가짐DualCodecSelf-Supervised Learning representation과 waveform representation을 integrateFirst-layer codec의 semantic information을 향상하고 low frame rate에서 동작논문 (INTERSPEECH 2025) : Paper Link1. IntroductionNeural audio codec은 audio signal을 discrete code..
EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech SynthesisEmotional Text-to-Speech는 여전히 intensity control 측면에서 한계가 있음EmoMixEmotion embedding을 추출하기 위해 pre-trained Speech Emotion Recognition model을 활용Run-time 시 diffusion model을 기반으로 mixed emotion synthesis를 수행논문 (INTERSPEECH 2023) : Paper Link1. IntroductionGenerSpeech와 같은 emotional Text-to-Speech (TTS) model은 reference-based style..
StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion기존의 Voice Conversion model은 linguistic content의 explicit utilization을 neglect 함StarVCExplicit text modeling을 voice conversion에 integrateText token을 먼저 predict 한 다음 acoustic feature를 synthesize 하는 autoregressive framework를 활용논문 (INTERSPEECH 2025) : Paper Link1. IntroductionVoice Conversion (VC)는 ut..
MELLE: Autoregressive Speech Synthesis without Vector QuantizationText-to-Speech를 위해 continuous-valued token based language modeling을 활용할 수 있음MELLESpectrogram Flux loss를 사용하여 continuous-valued token distribution을 modelingVariational inference를 incorporate 하여 diversity, robustness를 향상논문 (ACL 2025) : Paper Link1. IntroductionNext-token prediction은 previous token을 condition으로 하여 next discrete token..
UniCodec: Unified Audio Codec with Single Domain-Adaptive CodebookMulti-domain audio signal을 지원하는 audio codec이 필요함UniCodec각 audio domain의 distinct characterisitc을 capture 하기 위해 domain-adaptive codebook과 Mixture-of-Expert strategy를 활용Auxiliary module 없이 codec의 semantic density를 enrich 하기 위해 self-supervised mask prediction modeling approach를 적용논문 (ACL 2025) : Paper Link1. IntroductionSpeech Langua..
OZSpeech: One-Step Zero-Shot Speech Synthesis with Learned-Prior-Conditioned Flow MatchingWaveform, spectrogram과 같은 기존의 speech representation은 speech attribute를 overlooking 하고 high computational cost를 가짐OZSpeechOne-step sampling과 learned prior를 condition으로 사용하여 sampling step 수를 reduceToken format의 disentangled, factorized component를 활용하여 speech attributre를 modeling논문 (ACL 2025) : Paper Link1. In..
DiTAR: Diffusion Transformer Autoregressive Modeling for Speech GenerationDiffusion model과 autoregressive model을 결합하면 computational load와 suboptimal outcome이 발생함DiTARPatch generation을 위해 divide-and-conquer strategy를 도입Langauge model은 aggregated patch embedding을 처리한 다음, diffusion Transformer를 통해 next patch를 subsequently generate추론 시에는 reverse diffusion ODE 중 noise introducing time point를 temperat..
BEATs: Audio Pre-Training with Acoustic TokenizersGeneral audio representation pre-training을 위헌 Self-Supervised Learning framework가 필요함BEATsSemantic-rich acoustic tokenizer에서 얻어지는 label에 대한 discrete label prediction task를 활용Tokenizer와 pre-trained model에 대한 iterative pipeline을 구성논문 (ICML 2023) : Paper Link1. IntroductionWav2Vec 2.0, HuBERT, WavLM, Data2Vec 등의 speech Self-Supervised Learning (SSL)..
TCSinger2: Customizable Multilingual Zero-Shot Singing Voice Synthesis기존의 Singing Voice Synthesis는 다양한 prompt를 통한 multi-level style control이 부족함TCSinger2Blurred Boundary Content Encoder를 통해 duration을 predict 하고, content embedding을 extend 하여 smooth transition을 지원Custom Audio Encoder를 통해 singing, speech, textual prompt에서 aligned representation을 추출추가적으로 Flow-based Custom Encoder를 활용하여 style modelin..
