Listen Like a Teacher: Mitigating Whisper Hallucinations using Adaptive Layer Attention and Knowledge DistillationWhisper는 noisy acoustic condition에서 hallucination의 문제가 있음ALA & MOKDAdaptive Layer Attention (ALA)를 사용해 Whisper encoder의 robustness를 향상Multi-Objective Knowledge Distillation (MOKD) framework를 기반으로 hallucination을 suppress논문 (AAAI 2026) : Paper Link1. Introduction최근 Whisper와 같은 Transfor..
BlockDecoder: Boosting ASR Decoders with Context and Merger ModulesAttention-based Encoder-Decoder model에서 decoder는 Automatic Speech Recognition output을 autoregressively generate 함- 특히 initial layer는 textual context를 build 하고 later layer는 acoustic, textual informaiton을 merge 함BlockDecoderPurely text-based text encoder와 information을 combine 하는 merger를 도입Encoder representation을 reuse 하고 text encod..
LiteASR: Efficient Automatic Speech Recognition with Low-Rank ApproximationAutomatic Speech Recognition model은 encoder-decoder architecture로 인해 computationally-intense 함LiteASREncoder에 Low-Rank Compression을 적용하여 transcription accuracy를 maintain 하면서 inference cost를 절감Small calibration dataset을 활용하여 Principal Component Analysis를 적용하여 Linear transformation을 Low-Rank Matrix Multiplication chain으로 ap..
Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of ExpertsHard parameter sharing은 task interference로 인해 model performance가 저하됨S-MoE각 task를 designated expert에 route 하는 special guiding token을 활용해 gating function을 eliminate해당 S-MoE를 Speech-to-Text model에 적용하여 mixed-bandwidth input을 처리논문 (INTERSPEECH 2025) : Paper Link1. IntroductionSpeech-to-Text (STT) mode..
M2R-Whisepr: Multi-Stage and Multi-Scale Retrieval Augmentation for Enhancing WhisperWhisper는 다양한 subdialect를 acculately recognize 하는데 한계가 있음M2R-WhisperIn-Context Learning과 Retrieval-Augmented technique을 Whisper에 도입Pre-processing stage에서 sentence-level in-context learning을 적용하고 post-processing stage에서는 token-level $k$-Nearest Neighbor를 적용논문 (ICASSP 2025) : Paper Link1. IntroductionWhisper는 Autom..
Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASRLarge Transformer-based model은 self-attention mechanism으로 인해 computationally intensive 함Whisper-MedusaWhisper architecture를 extend 하여 iteration 마다 multiple token을 predictWord Error Rate에 대한 영향을 최소화하면서 latency를 50% 절감논문 (ICASSP 2025) : Paper Link1. IntroductionWhisper와 같은 Transformer-based supervised model은 Automatic ..