π€ AI Summary
This work addresses the vulnerability of isolated-word speech recognition to noise, pronunciation variability, and channel distortion in low-resource, context-scarce critical scenarios such as healthcare and emergency communications. To mitigate these challenges, the authors propose a modular framework that integrates deep learning-based denoising with a hybrid ASR front-end combining Whisper and Vosk, augmented by a lightweight context-aware verification layer. This layer leverages large language modelβguided matching, embedding similarity, and edit distance to effectively handle out-of-vocabulary terms and degraded audio quality. Experimental results demonstrate that the proposed approach significantly enhances recognition robustness on both the Google Speech Commands dataset and real-world telephone/message data, achieving substantial accuracy gains under noisy and compressed-channel conditions while maintaining low latency suitable for real-time communication.
π Abstract
Single-word Automatic Speech Recognition (ASR) is a challenging task due to the lack of linguistic context and sensitivity to noise, pronunciation variation, and channel artifacts, especially in low-resource, communication-critical domains such as healthcare and emergency response. This paper reviews recent deep learning approaches and proposes a modular framework for robust single-word detection. The system combines denoising and normalization with a hybrid ASR front end (Whisper + Vosk) and a verification layer designed to handle out-of-vocabulary words and degraded audio. The verification layer supports multiple matching strategies, including embedding similarity, edit distance, and LLM-based matching with optional contextual guidance. We evaluate the framework on the Google Speech Commands dataset and a curated real-world dataset collected from telephony and messaging platforms under bandwidth-limited conditions. Results show that while the hybrid ASR front end performs well on clean audio, the verification layer significantly improves accuracy on noisy and compressed channels. Context-guided and LLM-based matching yield the largest gains, demonstrating that lightweight verification and context mechanisms can substantially improve single-word ASR robustness without sacrificing latency required for real-time telephony applications.