Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech SynthesisFlow-matching-based Text-to-Speech model은 cross-lingual task에 적용하기 어려움Cross-Lingual F5-TTSForced alignment를 활용하여 audio prompt를 pre-process 해 word boundary를 얻어 audio prompt로부터 direct synthesis를 수행Duration modeling을 위해 다양한 linguistic granularity를 가지는 speaking rate predictor를 도입논문 (ICASSP 2026) : Paper Link1. Introduc..
SUNAC: Source-Aware Unified Neural Audio CodecNeural Audio Codec은 multiple source mixture를 entangled manner로 encode 하므로 특정 source의 subset에 access 하는 downstream processing에는 부적합할 수 있음SUNACSource type prompt에 condition되어 mixture에서 individual source를 encodeSource-aware codec을 통해 user-driven selection과 separate encoding을 지원논문 (ICASSP 2026) : Paper Link1. IntroductionNeural Audio Codec (NAC)는 audio s..
VoXtream: Full-Stream Text-to-Speech with Extremely Low LatencyReal-time zero-shot streaming text-to-speech model이 필요함VoXtreamLimited look-ahead를 사용하여 incoming phoneme을 audio token으로 directly mapping구조적으로는 incremental phoneme transformer, temporal transformer, depth transformer를 활용논문 (ICASSP 2026) : Paper Link1. IntroductionLow-latency streaming Text-to-Speech (TTS)를 위해서는 first-packet latency를 m..
MeanVoiceFlow: One-Step Nonparallel Voice Conversion with Mean FlowsVoice Conversion에서 flow-matching model은 iterative inference로 인한 한계가 있음MeanVoiceFlowMean Flow를 기반으로 pre-training, distillation 없이 one-step non-parallel conversion을 지원추가적으로 structural margin reconstruction loss, zero-input constraint를 도입하여 model의 input-output behavior를 regularize논문 (ICASSP 2026) : Paper Link1. IntroductionVoice Co..
CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynmaic Frame Rate대부분의 neural codec은 fixed-frame rate에서 동작하므로 temporal mismatch가 존재함CodecSlimeSchedulable Dynamic Frame Rate를 활용하여 neural codec에서 temporal redundancy를 compressMelt-and-Cool training을 도입해 adaptation을 향상논문 (ICASSP 2026) : Paper Link1. IntroductionNeural speech codec은 lowest achievable frame rate에서 best possible ..
DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech SynthesisEnvironment-aware Text-to-Speech를 위한 model이 필요함DAIEN-TTSPre-trained Speech-Environment Separation module을 활용하여 clean speech, environment audio의 mel-spectrogram을 추출하고, random span mask를 각 mel-spectrogram에 적용하여 infilling process를 지원Speech, environment component에 Dual Classifier-Free Guidance를 적용하여 controllability..
