StereoMamba: Real-time and Robust Intraoperative Stereo Disparity Estimation via Long-range Spatial Dependencies

📅 2025-04-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of real-time, robust stereo disparity estimation in Robot-Assisted Minimally Invasive Surgery (RAMIS), this paper proposes a novel Mamba-based state-space model. Our method introduces the first FE-Mamba module to capture long-range cross-image dependencies, designs a Multi-dimensional Feature Fusion (MFF) mechanism, and incorporates disparity-guided reconstruction with SSIM/PSNR-based consistency evaluation. To our knowledge, this is the first work achieving zero-shot cross-domain generalization for medical binocular depth estimation: on the SCARED benchmark, it achieves an End-Point Error (EPE) of 2.64 px, a depth MAE of 2.55 mm, and real-time inference at 21.28 FPS (1280×1024). Furthermore, without fine-tuning, it attains state-of-the-art zero-shot transfer performance on RIS2017 and StereoMIS, achieving the highest SSIM (0.8970) and PSNR (16.0761) among existing methods.

Technology Category

Application Category

📝 Abstract
Stereo disparity estimation is crucial for obtaining depth information in robot-assisted minimally invasive surgery (RAMIS). While current deep learning methods have made significant advancements, challenges remain in achieving an optimal balance between accuracy, robustness, and inference speed. To address these challenges, we propose the StereoMamba architecture, which is specifically designed for stereo disparity estimation in RAMIS. Our approach is based on a novel Feature Extraction Mamba (FE-Mamba) module, which enhances long-range spatial dependencies both within and across stereo images. To effectively integrate multi-scale features from FE-Mamba, we then introduce a novel Multidimensional Feature Fusion (MFF) module. Experiments against the state-of-the-art on the ex-vivo SCARED benchmark demonstrate that StereoMamba achieves superior performance on EPE of 2.64 px and depth MAE of 2.55 mm, the second-best performance on Bad2 of 41.49% and Bad3 of 26.99%, while maintaining an inference speed of 21.28 FPS for a pair of high-resolution images (1280*1024), striking the optimum balance between accuracy, robustness, and efficiency. Furthermore, by comparing synthesized right images, generated from warping left images using the generated disparity maps, with the actual right image, StereoMamba achieves the best average SSIM (0.8970) and PSNR (16.0761), exhibiting strong zero-shot generalization on the in-vivo RIS2017 and StereoMIS datasets.
Problem

Research questions and friction points this paper is trying to address.

Achieving real-time stereo disparity estimation in RAMIS
Balancing accuracy, robustness, and inference speed
Enhancing long-range spatial dependencies in stereo images
Innovation

Methods, ideas, or system contributions that make the work stand out.

StereoMamba architecture for RAMIS disparity
FE-Mamba module enhances spatial dependencies
Multidimensional Feature Fusion integrates multi-scale features
🔎 Similar Papers
No similar papers found.
X
Xu Wang
Department of Medical Physics and Biomedical Engineering, University College London, London, UK
Jialang Xu
Jialang Xu
University College London
Surgical robotics visionDeep learningComputer visionWireless communication
S
Shuai Zhang
Department of Computer Science, University College London, London, UK
Baoru Huang
Baoru Huang
University of Liverpool; Imperial College London
RoboticsComputer visionSurgical visionImage-Guided Intervention
Danail Stoyanov
Danail Stoyanov
Professor of Robot Vision, University College London
Surgical VisionSurgical AISurgical RoboticsComputer Assisted InterventionsSurgical Data Science
E
Evangelos B. Mazomenos
Department of Medical Physics and Biomedical Engineering, University College London, London, UK