GST-BERT-TTS: Prosody Prediction without Accentual Labels for Multi-Speaker TTS using BERT with Global Style TokensText-to-Speech에서 pitch-accent language에 대한 prosody prediction은 중요함GST-BERT-TTSGlobal Style Token의 speaker-specific style embedding을 BERT의 token embedding에 integrateAccent label-free setting에서도 speaker-aware fundamental frequency를 predict 하고 $f_{0}$-BERT를 extend 하여 speech expressiven..
DinoSR: Self-Distillation and Online Clustering for Self-Supervised Speech Representation LearningSpeech를 위한 strong representation learning model이 필요함DinoSRMasked language modeling, self-distillation, online clustering을 combineTeacher network를 사용하여 input audio에서 contextualized embedding을 추출하고, embedding에 online clustering을 적용하고, discretized token을 통해 student network를 guide논문 (NeurIPS 2023) : Pap..
Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of ExpertsHard parameter sharing은 task interference로 인해 model performance가 저하됨S-MoE각 task를 designated expert에 route 하는 special guiding token을 활용해 gating function을 eliminate해당 S-MoE를 Speech-to-Text model에 적용하여 mixed-bandwidth input을 처리논문 (INTERSPEECH 2025) : Paper Link1. IntroductionSpeech-to-Text (STT) mode..
SPCodec: Split and Prediction for Neural Speech Codec기존 neural codec은 서로 다른 frequency band 간의 correlation을 fully exploit 하지 못함SPCodecLatent split-and-prediction scheme을 활용한 group residual vector quantization module을 도입Low-/high-frequency representation을 disentangle 하여 feature redundancy를 reduce논문 (INTERSPEECH 2025) : Paper Link1. IntroductionSpeech codec은 일반적으로 encoder, quantizer, decoder로 구성됨특히..
DiceHuBERT: Distilling HuBERT with a Self-Supervised Learning ObjectiveCompact Self-Supervised Learning-based speech foundation model이 필요함DiceHuBERTHuBERT의 iterative self-distillation mechanism을 활용하여 original model을 student model로 directly replaceHuBERT pre-training과 동일한 objective를 사용해 additional module, architectural constraint를 eliminate논문 (INTERSPEECH 2025) : Paper Link1. IntroductionSelf-Sup..
STaR: Distilling Speech Temporal Relation for Lightweight Speech Self-Supervised Learning ModelsTransformer-based Speech Self-Supervised Learning model은 large parameter size와 computational cost를 가짐STaRSpeech temporal relation을 distilling 하여 Speech Self-Supervised Learning model을 compress특히 speech frame 간의 temporal relation을 transfer 하여 lightweight student를 얻음논문 (ICASSP 2024) : Paper Link1. Intro..
Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking DistillationHuBERT와 같은 Speech Self-Supervised Learning model은 상당한 parameter 수를 가짐ARMHuBERTTransformer layer에 대해 attention map을 reuse 하여 model을 compressStudent model의 representation quality를 향상하기 위해 masking distillation strategy를 도입논문 (INTERSPEECH 2023) : Paper Link1...
EATS-Speech: Emotion-Adaptive Transformation and Priority Synthesis for Zero-Shot Text-to-Speech기존의 zero-shot Text-to-Speech는 emotion을 효과적으로 반영하지 못함EATS-SpeechSpeech를 non-emotion style, emotion, content로 decompose 하는 parallel pipeline을 활용LLM-based converter를 통해 reference speech에서 text-emotion mapping을 학습논문 (INTERSPEECH 2025) : Paper Link1. IntroductionZero-Shot Text-to-Speech (TTS)는 speaker-spec..
FasterVoiceGrad: Faster One-Step Diffusion-based Voice Conversion with Adversarial Diffusion Conversion DistillationDiffusion-based Voice Conversion model은 iterative sampling으로 인해 상당히 느림FasterVoiceGradAdversarial Diffusion Conversion Distillation을 통해 diffusion model과 content encoder를 distill특히 효과적인 distillation을 위해 adversarial distillation, score distillation training을 활용논문 (INTERSPEECH 2025) : ..
