🤖 AI Summary
To address privacy-sensitive, few-shot, and continual learning challenges in personalized human action recognition for XR devices—where new action classes emerge incrementally—we propose a privacy-preserving few-shot continual learning framework. Methodologically, we introduce learnable spatiotemporal prompt-offset tuning into graph neural networks (GNNs) for the first time, establishing a lightweight prompt-tuning paradigm atop compact GNN backbones—eliminating reliance on large-scale pretrained Transformers. Crucially, raw skeletal or video data are never collected, stored, or replayed, ensuring end-to-end user privacy. Evaluated on the NTU RGB+D and SHREC-2017 continual benchmarks, our approach outperforms state-of-the-art methods: it reduces model parameters by 87% and inference latency by 63%, while achieving superior accuracy in incremental action-class adaptation. The framework thus delivers exceptional efficiency, low-resource deployment capability, and strong generalization across evolving action vocabularies.
📝 Abstract
As extended reality (XR) is redefining how users interact with computing devices, research in human action recognition is gaining prominence. Typically, models deployed on immersive computing devices are static and limited to their default set of classes. The goal of our research is to provide users and developers with the capability to personalize their experience by adding new action classes to their device models continually. Importantly, a user should be able to add new classes in a low-shot and efficient manner, while this process should not require storing or replaying any of user's sensitive training data. We formalize this problem as privacy-aware few-shot continual action recognition. Towards this end, we propose POET: Prompt-Offset Tuning. While existing prompt tuning approaches have shown great promise for continual learning of image, text, and video modalities; they demand access to extensively pretrained transformers. Breaking away from this assumption, POET demonstrates the efficacy of prompt tuning a significantly lightweight backbone, pretrained exclusively on the base class data. We propose a novel spatio-temporal learnable prompt offset tuning approach, and are the first to apply such prompt tuning to Graph Neural Networks. We contribute two new benchmarks for our new problem setting in human action recognition: (i) NTU RGB+D dataset for activity recognition, and (ii) SHREC-2017 dataset for hand gesture recognition. We find that POET consistently outperforms comprehensive benchmarks. Source code at https://github.com/humansensinglab/POET-continual-action-recognition.