UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-TrainingSpeaker characteristic modeling을 위해 Self-Supervised Learning을 활용할 수 있음UniSpeech-SATMulti-task learning을 도입하여 utterance-wise contrastive loss를 Self-Supervised Learning objective와 integrateUtterance mixing strategy 기반의 data augmentation을 수행논문 (ICASSP 2022) : Paper Link1. IntroductionSelf-Supervised Learning (SSL..
UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled DataSpeech representation learning을 위해 labeled, unlabeled data를 모두 활용할 수 있음UniSpeechSupervised phonetic CTC learning과 phonetically-aware contrastive self-supervised learning을 활용Resultant representation은 phonetic structure를 capture 하여 language, domain에 대한 generalization을 향상논문 (ICML 2021) : Paper Link1. IntroductionAutoma..
DQ-Data2Vec: Decoupling Quantization for Multilingual Speech RecognitionData2Vec의 masked representation generation은 multi-layer averaging에 의존적임DQ-Data2Vec$K$-means quantizer를 사용하여 masked prediction을 위한 language, phoneme information을 decoupling특히 quantization을 shallow, middle layer 모두에 적용하여 irrelevant feature를 explicitly decoupling논문 (TASLP 2025) : Paper Link1. IntroductionXLSR과 같은 Self-Supervise..
ParaStyleTTS: Toward Efficient and Robust Paralinguistic Style Control for Expressive Text-to-Speech GenerationText-to-Speech system에서 speaking style control은 여전히 한계가 있음ParaStyleTTSProsodic, paralinguistic speech style modeling을 separate 하는 2-level style adaptation architecture를 도입추가적으로 low-resource deployment와 다양한 prompt formulation에 대한 consistent style을 유지논문 (CIKM 2025) : Paper Link1. Introduc..
Variable Bitrate Residual Vector Quantization for Audio CodingNeural audio codec은 rate-distortion trade-off 측면에서 suboptimal 함VRVQFrame 당 사용되는 codebook 수를 adapting 하여 efficient coding을 지원Importance map을 binary importance mask로 transform 하는 non-differentiable masking operation에 대한 gradient estimation method를 도입논문 (ICASSP 2025) : Paper Link1. Introduction최근 SoundStream, EnCodec, DAC와 같은 Residual Ve..
DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech SynthesisDiffusion model은 iterative denoising process로 인해 computationally intensive 함DMOSpeechDistilled diffusion-based model을 활용하여 teacher 보다 더 빠른 추론 속도를 달성Connectionist Temporal Classification, Speaker Verification loss에 대한 end-to-end optimization을 지원논문 (ICML 2025) : Paper Link1. IntroductionSpeechX, MaskGC..
