SW-ASR: A Context-Aware Hybrid ASR Pipeline for Robust Single Word Speech Recognition

πŸ“… 2026-01-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the vulnerability of isolated-word speech recognition to noise, pronunciation variability, and channel distortion in low-resource, context-scarce critical scenarios such as healthcare and emergency communications. To mitigate these challenges, the authors propose a modular framework that integrates deep learning-based denoising with a hybrid ASR front-end combining Whisper and Vosk, augmented by a lightweight context-aware verification layer. This layer leverages large language model–guided matching, embedding similarity, and edit distance to effectively handle out-of-vocabulary terms and degraded audio quality. Experimental results demonstrate that the proposed approach significantly enhances recognition robustness on both the Google Speech Commands dataset and real-world telephone/message data, achieving substantial accuracy gains under noisy and compressed-channel conditions while maintaining low latency suitable for real-time communication.

Technology Category

Application Category

πŸ“ Abstract
Single-word Automatic Speech Recognition (ASR) is a challenging task due to the lack of linguistic context and sensitivity to noise, pronunciation variation, and channel artifacts, especially in low-resource, communication-critical domains such as healthcare and emergency response. This paper reviews recent deep learning approaches and proposes a modular framework for robust single-word detection. The system combines denoising and normalization with a hybrid ASR front end (Whisper + Vosk) and a verification layer designed to handle out-of-vocabulary words and degraded audio. The verification layer supports multiple matching strategies, including embedding similarity, edit distance, and LLM-based matching with optional contextual guidance. We evaluate the framework on the Google Speech Commands dataset and a curated real-world dataset collected from telephony and messaging platforms under bandwidth-limited conditions. Results show that while the hybrid ASR front end performs well on clean audio, the verification layer significantly improves accuracy on noisy and compressed channels. Context-guided and LLM-based matching yield the largest gains, demonstrating that lightweight verification and context mechanisms can substantially improve single-word ASR robustness without sacrificing latency required for real-time telephony applications.
Problem

Research questions and friction points this paper is trying to address.

single-word ASR
noise robustness
context-aware recognition
low-resource domains
degraded audio
Innovation

Methods, ideas, or system contributions that make the work stand out.

single-word ASR
hybrid ASR
context-aware verification
LLM-based matching
robust speech recognition
πŸ”Ž Similar Papers
No similar papers found.