๐ค AI Summary
To address the challenges of poor speech recognition for elderly usersโ disfluent speech and the inability of conventional ASR-LLM cascaded systems to detect non-speech events (e.g., falls, cries for help), this paper proposes DESAMOโthe first edge-deployed, embedded Audio Large Language Model (Audio LLM) system tailored for elderly-friendly smart homes. DESAMO eliminates reliance on ASR by performing multi-granularity audio understanding directly on-device from raw waveforms, jointly modeling natural speech and critical non-speech events while ensuring real-time responsiveness, robustness, and on-device privacy preservation. Experiments demonstrate significant improvements in both elderly speech recognition accuracy and emergency event detection reliability, with average inference latency under 200 ms and end-to-end local data processing. Its core contributions are: (1) the first efficient deployment of an Audio LLM on resource-constrained embedded hardware, and (2) a novel end-to-end audio semantic understanding paradigm specifically designed for elderly users.
๐ Abstract
We present DESAMO, an on-device smart home system for elder-friendly use powered by Audio LLM, that supports natural and private interactions. While conventional voice assistants rely on ASR-based pipelines or ASR-LLM cascades, often struggling with the unclear speech common among elderly users and unable to handle non-speech audio, DESAMO leverages an Audio LLM to process raw audio input directly, enabling a robust understanding of user intent and critical events, such as falls or calls for help.