A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains

πŸ“… 2025-07-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the challenge of real-time hand-object interaction detection from a first-person industrial perspective, this paper proposes a cascaded end-to-end architecture: a lightweight Mamba-EfficientNetV2 model performs action-triggered interaction recognition, which subsequently drives fine-tuned YOLOWorld for joint hand and object detection. This work is the first to introduce the Mamba architecture into this task, achieving a balanced trade-off between accuracy and efficiency while maintaining real-time inference at 30 FPS. On the ENIGMA-51 dataset, the method achieves a part-level average precision (p-AP) of 38.52% for action recognition and an average precision (AP) of 85.13% for hand-object detection. The approach integrates practical deployability with state-of-the-art performance, providing a low-latency, intuitive solution for human-robot collaborative scenarios.

Technology Category

Application Category

πŸ“ Abstract
Hand-object interaction detection remains an open challenge in real-time applications, where intuitive user experiences depend on fast and accurate detection of interactions with surrounding objects. We propose an efficient approach for detecting hand-objects interactions from streaming egocentric vision that operates in real time. Our approach consists of an action recognition module and an object detection module for identifying active objects upon confirmed interaction. Our Mamba model with EfficientNetV2 as backbone for action recognition achieves 38.52% p-AP on the ENIGMA-51 benchmark at 30fps, while our fine-tuned YOLOWorld reaches 85.13% AP for hand and object. We implement our models in a cascaded architecture where the action recognition and object detection modules operate sequentially. When the action recognition predicts a contact state, it activates the object detection module, which in turn performs inference on the relevant frame to detect and classify the active object.
Problem

Research questions and friction points this paper is trying to address.

Detect hand-object interactions in real-time industrial applications
Improve accuracy and speed of egocentric vision interaction detection
Identify active objects during confirmed hand-object interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-time egocentric hand-object interaction detection
Cascaded Mamba and YOLOWorld model architecture
EfficientNetV2 backbone for action recognition
πŸ”Ž Similar Papers
No similar papers found.