S4Fusion: Saliency-aware Selective State Space Model for Infrared Visible Image Fusion

📅 2024-05-31

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Infrared and visible-light image fusion suffers from insufficient cross-modal global spatial interaction, incomplete salient object perception, and modality bias. To address these issues, this paper proposes a cross-modal fusion framework based on a selective state space model (SSSM). Our method introduces: (1) a Cross-Modal Spatial Attention (CMSA) module that enables global, cooperative modeling of infrared and visible-light features; and (2) an uncertainty-driven saliency adaptive enhancement mechanism that dynamically weights features to preserve salient objects. This mechanism integrates a pre-trained uncertainty estimation network with multi-scale feature interaction. Evaluated on multiple benchmarks, our approach achieves state-of-the-art performance in both quantitative fusion quality metrics and qualitative visual fidelity. Moreover, downstream tasks—including object detection and recognition—demonstrate consistent accuracy improvements, validating the effectiveness and generalizability of the fused representations.

Technology Category

Application Category

📝 Abstract

As one of the tasks in Image Fusion, Infrared and Visible Image Fusion aims to integrate complementary information captured by sensors of different modalities into a single image. The Selective State Space Model (SSSM), known for its ability to capture long-range dependencies, has demonstrated its potential in the field of computer vision. However, in image fusion, current methods underestimate the potential of SSSM in capturing the global spatial information of both modalities. This limitation prevents the simultaneous consideration of the global spatial information from both modalities during interaction, leading to a lack of comprehensive perception of salient targets. Consequently, the fusion results tend to bias towards one modality instead of adaptively preserving salient targets. To address this issue, we propose the Saliency-aware Selective State Space Fusion Model (S4Fusion). In our S4Fusion, the designed Cross-Modal Spatial Awareness Module (CMSA) can simultaneously focus on global spatial information from both modalities while facilitating their interaction, thereby comprehensively capturing complementary information. Additionally, S4Fusion leverages a pre-trained network to perceive uncertainty in the fused images. By minimizing this uncertainty, S4Fusion adaptively highlights salient targets from both images. Extensive experiments demonstrate that our approach produces high-quality images and enhances performance in downstream tasks.

Problem

Research questions and friction points this paper is trying to address.

Underutilization of SSSM in capturing global spatial information for image fusion

Lack of comprehensive perception of salient targets in fused images

Bias towards one modality instead of adaptive salient target preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Saliency-aware Selective State Space Model

Cross-Modal Spatial Awareness Module

Pre-trained network for uncertainty perception

🔎 Similar Papers

No similar papers found.