M2I2HA: Multi-modal Object Detection Based on Intra- and Inter-Modal Hypergraph Attention

๐Ÿ“… 2026-01-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenges of insufficient intra-modal feature extraction and difficult cross-modal alignment in multimodal object detection. It introduces hypergraph theory to this task for the first time, proposing a hypergraph attentionโ€“based multimodal perception network. The method effectively models high-order many-to-many relationships through an intra-modal hypergraph enhancement mechanism and an inter-modal hypergraph fusion strategy, thereby bridging semantic and spatial discrepancies across modalities. An adaptive multi-level fusion module is further integrated to optimize feature distribution and flow. By overcoming the limitations of conventional CNNs, Transformers, and state space models in capturing high-order dependencies and preserving spatial structure, the proposed approach achieves state-of-the-art performance, significantly outperforming existing methods on multiple public benchmarks.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent advances in multi-modal detection have significantly improved detection accuracy in challenging environments (e.g., low light, overexposure). By integrating RGB with modalities such as thermal and depth, multi-modal fusion increases data redundancy and system robustness. However, significant challenges remain in effectively extracting task-relevant information both within and across modalities, as well as in achieving precise cross-modal alignment. While CNNs excel at feature extraction, they are limited by constrained receptive fields, strong inductive biases, and difficulty in capturing long-range dependencies. Transformer-based models offer global context but suffer from quadratic computational complexity and are confined to pairwise correlation modeling. Mamba and other State Space Models (SSMs), on the other hand, are hindered by their sequential scanning mechanism, which flattens 2D spatial structures into 1D sequences, disrupting topological relationships and limiting the modeling of complex higher-order dependencies. To address these issues, we propose a multi-modal perception network based on hypergraph theory called M2I2HA. Our architecture includes an Intra-Hypergraph Enhancement module to capture global many-to-many high-order relationships within each modality, and an Inter-Hypergraph Fusion module to align, enhance, and fuse cross-modal features by bridging configuration and spatial gaps between data sources. We further introduce a M2-FullPAD module to enable adaptive multi-level fusion of multi-modal enhanced features within the network, meanwhile enhancing data distribution and flow across the architecture. Extensive object detection experiments on multiple public datasets against baselines demonstrate that M2I2HA achieves state-of-the-art performance in multi-modal object detection tasks.
Problem

Research questions and friction points this paper is trying to address.

multi-modal object detection
cross-modal alignment
intra-modal feature extraction
inter-modal feature fusion
high-order dependencies
Innovation

Methods, ideas, or system contributions that make the work stand out.

hypergraph attention
multi-modal fusion
high-order dependency
cross-modal alignment
object detection
๐Ÿ”Ž Similar Papers
No similar papers found.
X
Xiaofan Yang
Y
Yubin Liu
W
Wei Pan
G
Guoqing Chu
Junming Zhang
Junming Zhang
Zhejiang University
Power ElectronicsPower management
J
Jie Zhao
Z
Zhuoqi Man
X
Xuanming Cao