Human Activity Recognition using RGB-Event based Sensors: A Multi-modal Heat Conduction Model and A Benchmark Dataset

📅 2025-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the severe performance degradation of RGB cameras in human activity recognition (HAR) under low-light and high-speed motion conditions, this paper proposes MMHCO-HAR, a novel multimodal RGB-event fusion framework. Methodologically, it introduces the first physics-inspired thermal conduction operator layer, incorporating learnable thermal conductivity coefficients (FVEs) and a strategy-based routing mechanism for adaptive fusion, along with a dual-stream stem network and a DCT-IDCT feature coupling module. Contributions include: (1) the release of HARDVS 2.0—the first large-scale, high-quality RGB-event paired HAR benchmark dataset, comprising 300 classes and 107,646 samples—filling a critical gap in the field; and (2) comprehensive experiments demonstrating significant improvements in robustness, accuracy, and generalization across challenging scenarios. The code and dataset are publicly available.

Technology Category

Application Category

📝 Abstract
Human Activity Recognition (HAR) primarily relied on traditional RGB cameras to achieve high-performance activity recognition. However, the challenging factors in real-world scenarios, such as insufficient lighting and rapid movements, inevitably degrade the performance of RGB cameras. To address these challenges, biologically inspired event cameras offer a promising solution to overcome the limitations of traditional RGB cameras. In this work, we rethink human activity recognition by combining the RGB and event cameras. The first contribution is the proposed large-scale multi-modal RGB-Event human activity recognition benchmark dataset, termed HARDVS 2.0, which bridges the dataset gaps. It contains 300 categories of everyday real-world actions with a total of 107,646 paired videos covering various challenging scenarios. Inspired by the physics-informed heat conduction model, we propose a novel multi-modal heat conduction operation framework for effective activity recognition, termed MMHCO-HAR. More in detail, given the RGB frames and event streams, we first extract the feature embeddings using a stem network. Then, multi-modal Heat Conduction blocks are designed to fuse the dual features, the key module of which is the multi-modal Heat Conduction Operation layer. We integrate RGB and event embeddings through a multi-modal DCT-IDCT layer while adaptively incorporating the thermal conductivity coefficient via FVEs into this module. After that, we propose an adaptive fusion module based on a policy routing strategy for high-performance classification. Comprehensive experiments demonstrate that our method consistently performs well, validating its effectiveness and robustness. The source code and benchmark dataset will be released on https://github.com/Event-AHU/HARDVS/tree/HARDVSv2
Problem

Research questions and friction points this paper is trying to address.

Overcoming RGB camera limitations in human activity recognition
Creating a multi-modal RGB-Event benchmark dataset
Developing a heat conduction model for activity recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines RGB and event cameras for HAR
Introduces multi-modal heat conduction framework
Develops large-scale benchmark dataset HARDVS 2.0
🔎 Similar Papers
No similar papers found.
Shiao Wang
Shiao Wang
安徽大学
Deep Learning
X
Xiao Wang
School of Computer Science and Technology, Anhui University, Hefei, 230601, China.
B
Bo Jiang
School of Computer Science and Technology, Anhui University, Hefei, 230601, China.
L
Lin Zhu
Peng Cheng Laboratory, Beijing, China.
Guoqi Li
Guoqi Li
Professor, Institue of Automation,Chinese Academy of Sciences,Previously Tsinghua University
Brain inspired computingSpiking neural networksBrain inspired large modelsNeuroAI
Yaowei Wang
Yaowei Wang
The Hong Kong Polytechnic University
Y
Yonghong Tian
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, China.
Jin Tang
Jin Tang
Anhui University
Computer visionintelligent video analysis