DMOSpeech2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech SynthesisDiffusion-based Text-to-Speech의 component를 perceptual metric에 optimize 하는 것은 어려움DMOSpeech2Speaker similarity와 Word Error Rate를 reward로 사용하는 Group Relative Preference Optimization을 적용추가적으로 teacher-guided sampling을 통해 output diversity를 향상논문 (AAAI 2026) : Paper Link1. IntroductionNaturalSpeech, StyleTTS2와..
IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-SpeechLarge-scale autoregressive Text-to-Speech model은 token-by-token generation으로 인해 synthesized speech의 duration을 control 하기 어려움IndexTTS2Token 수를 explicitly specify 하거나 autoregressive manner로 freely generate 하여 duration을 controlEmotional expression, speaker identity 간의 disentanglement를..
SACodec: Asymmetric Quantization with Semantic Anchoring for Low-Bitrate High-Fidelity Neural Speech CodecsNeural speech codec은 low bitrate에서 fundamental trade-off가 존재함SACodecSemantic Anchoring mechanism을 활용한 asymmetric dual quantizer를 도입Semantic/acoustic detail quantization을 decouple 하여 codebook utilization과 fine-grained information reconstruction을 보장논문 (AAAI 2026) : Paper Link1. IntroductionNe..
KALL-E: Autoregressive Speech Synthesis with Next-Distribution PredictionText-to-Speech를 위해 autoregressive language model을 활용할 수 있음KALL-EFlow-VAE를 활용하여 waveform으로부터 continuous latent speech representation을 추출Single AR Transformer를 통해 text로부터 해당 continuous speech distribution을 predict논문 (AAAI 2026) : Paper Link1. IntroductionVALL-E와 같이 Text-to-Speech (TTS)를 위해 Large Language Model (LLM)을 활용할 수 있음..
DegVoC: Revisiting Neural Vocoder from a Degradation Perspective기존의 neural vocoder는 performance-cost trade-off가 존재함DegVoCMel-spectrogram을 target spectrum으로부터의 signal degradation process로 취급Degradation prior를 활용하여 simple linear transformation을 통해 initial spectral structure를 retrieve 하고 time-frequency domain에서 heterogeneous distribution을 고려한 deep prior solver를 도입논문 (AAAI 2026) : Paper Link1. Intro..
MELA-TTS: Joint Transformer-Diffusion Model with Representation Alignment for Speech SynthesisEnd-to-End Text-to-Speech를 위해 joint Transformer-Diffusion framework를 활용할 수 있음MELA-TTSLinguistic, speaker condition으로부터 continuous mel-spectrogram을 autoregressively generateTransformer decoder의 output representation을 pre-trained ASR encoder의 semantic embedding과 align 하는 representation alignment module을 도..
