SW-ASR: A Context-Aware Hybrid ASR Pipeline for Robust Single Word Speech Recognition

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the vulnerability of isolated-word speech recognition to noise, pronunciation variability, and channel distortion in low-resource, context-scarce critical scenarios such as healthcare and emergency communications. To mitigate these challenges, the authors propose a modular framework that integrates deep learning-based denoising with a hybrid ASR front-end combining Whisper and Vosk, augmented by a lightweight context-aware verification layer. This layer leverages large language model–guided matching, embedding similarity, and edit distance to effectively handle out-of-vocabulary terms and degraded audio quality. Experimental results demonstrate that the proposed approach significantly enhances recognition robustness on both the Google Speech Commands dataset and real-world telephone/message data, achieving substantial accuracy gains under noisy and compressed-channel conditions while maintaining low latency suitable for real-time communication.

Technology Category

Application Category

📝 Abstract

Single-word Automatic Speech Recognition (ASR) is a challenging task due to the lack of linguistic context and sensitivity to noise, pronunciation variation, and channel artifacts, especially in low-resource, communication-critical domains such as healthcare and emergency response. This paper reviews recent deep learning approaches and proposes a modular framework for robust single-word detection. The system combines denoising and normalization with a hybrid ASR front end (Whisper + Vosk) and a verification layer designed to handle out-of-vocabulary words and degraded audio. The verification layer supports multiple matching strategies, including embedding similarity, edit distance, and LLM-based matching with optional contextual guidance. We evaluate the framework on the Google Speech Commands dataset and a curated real-world dataset collected from telephony and messaging platforms under bandwidth-limited conditions. Results show that while the hybrid ASR front end performs well on clean audio, the verification layer significantly improves accuracy on noisy and compressed channels. Context-guided and LLM-based matching yield the largest gains, demonstrating that lightweight verification and context mechanisms can substantially improve single-word ASR robustness without sacrificing latency required for real-time telephony applications.

Problem

Research questions and friction points this paper is trying to address.

single-word ASR

noise robustness

context-aware recognition

low-resource domains

degraded audio

Innovation

Methods, ideas, or system contributions that make the work stand out.

single-word ASR

hybrid ASR

context-aware verification