🤖 AI Summary
Infrared and visible-light image fusion suffers from insufficient cross-modal global spatial interaction, incomplete salient object perception, and modality bias. To address these issues, this paper proposes a cross-modal fusion framework based on a selective state space model (SSSM). Our method introduces: (1) a Cross-Modal Spatial Attention (CMSA) module that enables global, cooperative modeling of infrared and visible-light features; and (2) an uncertainty-driven saliency adaptive enhancement mechanism that dynamically weights features to preserve salient objects. This mechanism integrates a pre-trained uncertainty estimation network with multi-scale feature interaction. Evaluated on multiple benchmarks, our approach achieves state-of-the-art performance in both quantitative fusion quality metrics and qualitative visual fidelity. Moreover, downstream tasks—including object detection and recognition—demonstrate consistent accuracy improvements, validating the effectiveness and generalizability of the fused representations.
📝 Abstract
As one of the tasks in Image Fusion, Infrared and Visible Image Fusion aims to integrate complementary information captured by sensors of different modalities into a single image. The Selective State Space Model (SSSM), known for its ability to capture long-range dependencies, has demonstrated its potential in the field of computer vision. However, in image fusion, current methods underestimate the potential of SSSM in capturing the global spatial information of both modalities. This limitation prevents the simultaneous consideration of the global spatial information from both modalities during interaction, leading to a lack of comprehensive perception of salient targets. Consequently, the fusion results tend to bias towards one modality instead of adaptively preserving salient targets. To address this issue, we propose the Saliency-aware Selective State Space Fusion Model (S4Fusion). In our S4Fusion, the designed Cross-Modal Spatial Awareness Module (CMSA) can simultaneously focus on global spatial information from both modalities while facilitating their interaction, thereby comprehensively capturing complementary information. Additionally, S4Fusion leverages a pre-trained network to perceive uncertainty in the fused images. By minimizing this uncertainty, S4Fusion adaptively highlights salient targets from both images. Extensive experiments demonstrate that our approach produces high-quality images and enhances performance in downstream tasks.