🤖 AI Summary
On portable XR devices, egocentric action recognition (EAR) faces a fundamental trade-off among computational overhead, power consumption, and accuracy. Method: This paper proposes an asynchronous multimodal sampling and lightweight fusion framework. It first identifies the complementary temporal sampling characteristics between RGB video and high-frequency 3D hand pose sequences, then designs a cross-modal joint sampling-rate optimization strategy that synergistically leverages RGB frame-rate reduction and high-frequency hand-pose capture. Further, it introduces a low-complexity multimodal temporal modeling and feature fusion network. Results: Experiments demonstrate that the method maintains near-lossless recognition accuracy (degradation <0.5%), reduces CPU utilization by 3×, and significantly improves edge-side real-time performance and energy efficiency—establishing a novel paradigm for efficient EAR on wearable XR platforms.
📝 Abstract
The increasing availability of wearable XR devices opens new perspectives for Egocentric Action Recognition (EAR) systems, which can provide deeper human understanding and situation awareness. However, deploying real-time algorithms on these devices can be challenging due to the inherent trade-offs between portability, battery life, and computational resources. In this work, we systematically analyze the impact of sampling frequency across different input modalities - RGB video and 3D hand pose - on egocentric action recognition performance and CPU usage. By exploring a range of configurations, we provide a comprehensive characterization of the trade-offs between accuracy and computational efficiency. Our findings reveal that reducing the sampling rate of RGB frames, when complemented with higher-frequency 3D hand pose input, can preserve high accuracy while significantly lowering CPU demands. Notably, we observe up to a 3x reduction in CPU usage with minimal to no loss in recognition performance. This highlights the potential of multimodal input strategies as a viable approach to achieving efficient, real-time EAR on XR devices.