🤖 AI Summary
Current medical image fusion methods face two key bottlenecks: CNNs exhibit limited capacity for modeling global contextual dependencies, while Transformers suffer from high computational complexity and insufficient 3D extension and clinical validation. To address these challenges, we propose the first CNN-Mamba hybrid architecture tailored for multimodal neuroimaging. Our method innovatively integrates state space models (SSMs) to capture long-range dependencies with linear computational complexity. We further design a tri-plane scanning strategy to efficiently encode 3D volumetric structural correlations. The architecture jointly preserves fine-grained local details and holistic semantic context, enabling end-to-end real-time fusion. Quantitative evaluation across three public benchmarks demonstrates consistent superiority in all metrics. Moreover, fused images significantly improve downstream 2D/3D brain tumor classification performance, substantiating both clinical validity and deployment feasibility of our approach.
📝 Abstract
Multimodal medical image fusion integrates complementary information from different imaging modalities to enhance diagnostic accuracy and treatment planning. While deep learning methods have advanced performance, existing approaches face critical limitations: Convolutional Neural Networks (CNNs) excel at local feature extraction but struggle to model global context effectively, while Transformers achieve superior long-range modeling at the cost of quadratic computational complexity, limiting clinical deployment. Recent State Space Models (SSMs) offer a promising alternative, enabling efficient long-range dependency modeling in linear time through selective scan mechanisms. Despite these advances, the extension to 3D volumetric data and the clinical validation of fused images remains underexplored. In this work, we propose ClinicalFMamba, a novel end-to-end CNN-Mamba hybrid architecture that synergistically combines local and global feature modeling for 2D and 3D images. We further design a tri-plane scanning strategy for effectively learning volumetric dependencies in 3D images. Comprehensive evaluations on three datasets demonstrate the superior fusion performance across multiple quantitative metrics while achieving real-time fusion. We further validate the clinical utility of our approach on downstream 2D/3D brain tumor classification tasks, achieving superior performance over baseline methods. Our method establishes a new paradigm for efficient multimodal medical image fusion suitable for real-time clinical deployment.