Cross-Modality Feature Fusion Based on Structured State Space Duality for Multimodal Image Registration Network

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This work addresses the challenge of extracting shared structural information in multimodal image registration by proposing RegNetMamba-2, which introduces Structured State Space Duality (SSD) into a coarse-to-fine registration framework for the first time. By leveraging SSD to model both local and global structural features, and incorporating a cross-modal interaction (CMI) module alongside a progressive multi-scale fusion (MSF) mechanism, the method achieves highly efficient and accurate feature alignment. Extensive experiments on multiple benchmarks—including VIS-SAR, VIS-IR, and VIS-NIR—demonstrate that RegNetMamba-2 significantly outperforms existing deep learning-based approaches, setting new state-of-the-art results in both registration accuracy and inference efficiency.

📝 Abstract

In multi-modal image registration, the primary challenge lies in shared structural information extraction. Compared to Transformers, Structured State Space Duality (SSD) offers greater global structural feature extraction with higher efficiency during training and inference. Inspired by these advantages, we propose a novel algorithm for multi-modal image registration, named RegNetMamba-2. Our algorithm incorporates SSD into coarse-to-fine matching process to extract local and global structural features effectively. Firstly, SSD is applied in three different scales for multi-modal feature extraction in our network. To strengthen local representation, we pay more attention on foreground edge and structural information by feature scaling function of SSD. Secondly, for shared feature extraction of input images and multi-modal feature fusion in all scales, we propose cross-modality feature fusion model based on SSD, consisting of Cross-Modality feature Interaction (CMI) module and Multi-Scale feature Fusion (MSF) module. CMI module is designed for cross-modality feature extraction of each scale by SSD in cross form. MSF module is designed to employ a progressive upward fusion in feature-level to obtain fine features, consisting of multi-modal features in all scales. Following coarse-to-fine, the features in 1/8 scale from CMI and 1/2 scale from MSF are collected to calculate matching probability scores. Then we respectively establish matching process by correspondences of pixel-wise. Extensive experiments demonstrate that comparing with state-of-the-art deep-learning based algorithms, RegNetMamba-2 has achieved good effects in both performance and efficiency for multi-modal image registration on the following datasets: VIS-SAR (OSDataset), VIS-IR (LGHD/RoadSence) and VIS-NIR (RGB-NIR sense).

Problem

Research questions and friction points this paper is trying to address.

multimodal image registration

shared structural information

cross-modality feature fusion

feature extraction

image alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured State Space Duality

Cross-Modality Feature Fusion

Multimodal Image Registration