Efficient Egocentric Action Recognition with Multimodal Data

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
On portable XR devices, egocentric action recognition (EAR) faces a fundamental trade-off among computational overhead, power consumption, and accuracy. Method: This paper proposes an asynchronous multimodal sampling and lightweight fusion framework. It first identifies the complementary temporal sampling characteristics between RGB video and high-frequency 3D hand pose sequences, then designs a cross-modal joint sampling-rate optimization strategy that synergistically leverages RGB frame-rate reduction and high-frequency hand-pose capture. Further, it introduces a low-complexity multimodal temporal modeling and feature fusion network. Results: Experiments demonstrate that the method maintains near-lossless recognition accuracy (degradation <0.5%), reduces CPU utilization by 3×, and significantly improves edge-side real-time performance and energy efficiency—establishing a novel paradigm for efficient EAR on wearable XR platforms.

Technology Category

Application Category

📝 Abstract
The increasing availability of wearable XR devices opens new perspectives for Egocentric Action Recognition (EAR) systems, which can provide deeper human understanding and situation awareness. However, deploying real-time algorithms on these devices can be challenging due to the inherent trade-offs between portability, battery life, and computational resources. In this work, we systematically analyze the impact of sampling frequency across different input modalities - RGB video and 3D hand pose - on egocentric action recognition performance and CPU usage. By exploring a range of configurations, we provide a comprehensive characterization of the trade-offs between accuracy and computational efficiency. Our findings reveal that reducing the sampling rate of RGB frames, when complemented with higher-frequency 3D hand pose input, can preserve high accuracy while significantly lowering CPU demands. Notably, we observe up to a 3x reduction in CPU usage with minimal to no loss in recognition performance. This highlights the potential of multimodal input strategies as a viable approach to achieving efficient, real-time EAR on XR devices.
Problem

Research questions and friction points this paper is trying to address.

Analyze sampling frequency impact on egocentric action recognition
Explore accuracy-efficiency trade-offs in multimodal input strategies
Reduce CPU usage while maintaining recognition performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal data fusion for action recognition
Optimized sampling rates for efficiency
Reduced CPU usage with maintained accuracy
🔎 Similar Papers
No similar papers found.
M
Marco Calzavara
ETH Zurich
Ard Kastrati
Ard Kastrati
ETH Zurich
E
E. Zurich
M
Matteo Macchini
Magic Leap
R
R. Wattenhofer
ETH Zurich