
InstantSpeech: Instant Synchronous Text-to-Speech Synthesis for LLM-driven Voice ChatbotsLarge Language Model과 pair 된 text-to-speech model은 entire sentence가 생성될 때까지 synthesis를 수행하지 않으므로 response latency가 증가함InstantSpeechCausal Transformer-based acoustic model과 causal convolution-based vocoder를 combine 한 fully-parallel architecture를 활용Limited lookahead 내에서 speech quality를 향상하기 위해 knowledge distil..

CASC-XVC: Zero-Shot Cross-Lingual Voice Conversion with Content Accordant and Speaker Contrastive LossesCross-Lingual Voice Conversion은 language mismatch와 train-test inconsistency로 인해 한계가 있음CASC-XVCContent accordant loss와 Speaker contrastive loss를 incorporate 하고 content disentanglement를 위해 shared self-supervised learning representation과 information perturbation을 도입서로 다른 language의 utterance pair를..

ContentVec: An Improved Self-Supervised Speech Representation by Disentangling SpeakersSpeech representation은 unwanted variation을 disentangle 할 수 있어야 함ContentVecContent의 loss 없이 speaker disentanglement를 수행HuBERT를 기반으로 teacher, student를 모두 regularize 하는 disentangling method를 도입논문 (ICML 2022) : Paper Link1. IntroductionHuBERT와 같은 Speech Self-Supervised Learning (SSL)은 large-scale unannotated corpo..

Multi-Resolution HuBERT: Multi-Resolution Speech Self-Supervised Learning with Masked Unit Prediction기존의 self-supervised learning model은 20ms의 fixed resolution으로 speech signal을 처리하므로 서로 다른 resolution의 informational content를 overlook 함Multi-Resolution HuBERTSpeech self-supervised learning에 multi-resolution information을 incorporateHuBERT-style masked prediction objective를 개선한 hierarchical Transfor..

Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with Speech ReferenceCross-domain singing voice synthesis를 지원할 수 있는 unified framework가 필요함Everyone-Can-SingLyrics에 기반한 language content, musical score에 기반한 performance attribute, singing style, vocal technique 등의 multiple aspect control을 지원Pre-trained content embedding과 diffusion-based generator를 활용논문 (ICASSP 2025) : Paper Link1..

DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-SpeechDiffusion Transformer 기반의 speech model은 mel-spectrogram을 general image로 취급함DPI-TTSDiffusion Transformer를 기반으로 low-to-high frequency, frame-by-frame progressive inference approach를 적용하여 naturalness를 향상Fine-grained style temporal modeling을 도입하여 speaker style similarity를 개선논문 (ICASSP 2025) : Pape..