A Multimodal RGB and Events Dataset for Hand Detection in First-Person View

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of existing RGB-based hand detection in dynamic scenarios—such as frame-rate constraints, motion blur, and poor performance under low-light conditions—and the scarcity of annotated data hindering direct application of event cameras to object detection. To bridge this gap, the authors present the first publicly available multimodal hand detection dataset from a first-person perspective. Built upon the EgoHands RGB dataset, it leverages the v2e simulator to synthesize high-temporal-resolution event streams and employs fine-tuned YOLOv8 to generate RGB bounding boxes, which are then temporally interpolated to produce aligned event-domain labels. The dataset further supports diverse lighting and scale conditions. Experiments demonstrate that multimodal detection methods trained on this dataset achieve state-of-the-art performance, validating the efficacy of synthetic event data in enhancing robustness and optimizing the bandwidth–latency trade-off.

📝 Abstract

Existing hand detection algorithms work on images and the detection rate is restricted by the frame rate of the camera. In hand detection applications for moving robotic systems, conventional cameras cause motion blur, especially in darker lighting conditions. We can leverage the use of event-based cameras which possess a high dynamic range, high temporal resolution, and low power consumption. Recent work has shown that using a stereo setup of an event-based and a frame-based camera improves detection accuracy and the bandwidth-latency tradeoff. The main bottleneck in using event-based cameras in object detection and recognition tasks is a relatively low amount of training data. In this work, we propose a methodology and an exemplary synthetic event-based hand dataset from an egocentric, first-person view perspective. The data is synthesized from the existing RGB Egohands dataset with the v2e toolbox. Parameters of the v2e toolbox are varied to provide versions of the dataset with different lighting conditions and scales. Ground truth detections are generated with a fine-tuned YOLOv8 model which is applied to the RGB images in the Egohands dataset and interpolated on the high-temporal resolution events. We use the multi-modal dataset to perform hand detection with existing object detection algorithms which use a multi-modal setup of event and RGB cameras and demonstrate performance comparable to the state-of-the-art.

Problem

Research questions and friction points this paper is trying to address.

hand detection

event-based cameras

first-person view

multimodal dataset

training data scarcity

Innovation

Methods, ideas, or system contributions that make the work stand out.

event-based vision

multimodal dataset

hand detection