🤖 AI Summary
Traditional human activity recognition (HAR) methods suffer from poor generalization to unseen activities, rendering them ineffective in zero-shot settings. To address this, we propose Open-Vocabulary HAR (OV-HAR), a paradigm shift from fixed-label classification to natural language–based activity modeling. OV-HAR establishes bidirectional mapping between activities and textual descriptions via text-embedding regression and a pretrained embedding inversion model—without relying on autoregressive large language models. This enables zero-shot cross-modal HAR (e.g., pose, IMU, pressure sensing) alongside interpretable text generation. Experiments demonstrate that OV-HAR significantly improves generalization to novel activities, eliminates the zero-probability problem inherent in conventional classifiers, and ensures robust multimodal compatibility and semantic interpretability.
📝 Abstract
Conventional human activity recognition (HAR) relies on classifiers trained to predict discrete activity classes, inherently limiting recognition to activities explicitly present in the training set. Such classifiers would invariably fail, putting zero likelihood, when encountering unseen activities. We propose Open Vocabulary HAR (OV-HAR), a framework that overcomes this limitation by first converting each activity into natural language and breaking it into a sequence of elementary motions. This descriptive text is then encoded into a fixed-size embedding. The model is trained to regress this embedding, which is subsequently decoded back into natural language using a pre-trained embedding inversion model. Unlike other works that rely on auto-regressive large language models (LLMs) at their core, OV-HAR achieves open vocabulary recognition without the computational overhead of such models. The generated text can be transformed into a single activity class using LLM prompt engineering. We have evaluated our approach on different modalities, including vision (pose), IMU, and pressure sensors, demonstrating robust generalization across unseen activities and modalities, offering a fundamentally different paradigm from contemporary classifiers.