🤖 AI Summary
To address insufficient cross-modal feature relationship modeling, high-frequency detail loss, and poor downstream task adaptability in infrared–visible image fusion, this paper proposes a three-branch encoder–decoder architecture that jointly models modality complementarity and sharedness. We introduce a Cross-Attention Invertible (CAI) module to preserve textural details losslessly; a Graph Reasoning (GR) module to explicitly capture high-level cross-modal semantic correlations; and a multi-guided fusion mechanism integrating semantic awareness and task-driven optimization. The method hierarchically extracts features via hybrid deep convolutional and lightweight Transformer blocks, incorporating Bidirectional Feature Enhancement (BFE), GR, and multi-level aggregation modules. Our approach achieves state-of-the-art performance on both infrared–visible and medical image fusion benchmarks. Moreover, it significantly improves downstream tasks: object detection mAP@0.5 increases by 8.27%, and semantic segmentation mIoU rises by 5.85%. The source code is publicly available.
📝 Abstract
Multi-modality image fusion aims at fusing modality-specific (complementarity) and modality-shared (correlation) information from multiple source images. To tackle the problem of the neglect of inter-feature relationships, high-frequency information loss, and the limited attention to downstream tasks, this paper focuses on how to model correlation-driven decomposing features and reason high-level graph representation by efficiently extracting complementary information and aggregating multi-guided features. We propose a three-branch encoder-decoder architecture along with corresponding fusion layers as the fusion strategy. Firstly, shallow features from individual modalities are extracted by a depthwise convolution layer combined with the transformer block. In the three parallel branches of the encoder, Cross Attention and Invertible Block (CAI) extracts local features and preserves high-frequency texture details. Base Feature Extraction Module (BFE) captures long-range dependencies and enhances modality-shared information. Graph Reasoning Module (GR) is introduced to reason high-level cross-modality relations and simultaneously extract low-level detail features as CAI's modality-specific complementary information. Experiments demonstrate the competitive results compared with state-of-the-art methods in visible/infrared image fusion and medical image fusion tasks. Moreover, the proposed algorithm surpasses the state-of-the-art methods in terms of subsequent tasks, averagely scoring 8.27% mAP@0.5 higher in object detection and 5.85% mIoU higher in semantic segmentation. The code is avaliable at https://github.com/Abraham-Einstein/SMFNet/.