🤖 AI Summary
Existing zero-shot human activity recognition (HAR) methods rely heavily on large language models (LLMs) via prompting, raising critical concerns regarding privacy leakage, dependency on external services, and model version inconsistency. To address these issues, this paper proposes an LLM-free zero-shot HAR framework. Our approach unifies sensor time-series data and activity semantics into natural language representations, leveraging a custom-designed semantic encoder to learn cross-modal embeddings and establishing a cross-domain alignment mechanism between activity texts and sensor features for end-to-end zero-shot classification. We present the first systematic evaluation of language-based embedding transferability across six real-world HAR datasets under multi-source, cross-scenario settings. Experimental results demonstrate substantial improvements in cross-environment generalization performance. The proposed method offers a deployable, robust, and LLM-free zero-shot HAR paradigm—particularly suitable for privacy-sensitive applications such as smart homes.
📝 Abstract
Developing zero-shot human activity recognition (HAR) methods is a critical direction in smart home research -- considering its impact on making HAR systems work across smart homes having diverse sensing modalities, layouts, and activities of interest. The state-of-the-art solutions along this direction are based on generating natural language descriptions of the sensor data and feeding it via a carefully crafted prompt to the LLM to perform classification. Despite their performance guarantees, such ``prompt-the-LLM'' approaches carry several risks, including privacy invasion, reliance on an external service, and inconsistent predictions due to version changes, making a case for alternative zero-shot HAR methods that do not require prompting the LLMs. In this paper, we propose one such solution that models sensor data and activities using natural language, leveraging its embeddings to perform zero-shot classification and thereby bypassing the need to prompt the LLMs for activity predictions. The impact of our work lies in presenting a detailed case study on six datasets, highlighting how language modeling can bolster HAR systems in zero-shot recognition.