EndoMamba: An Efficient Foundation Model for Endoscopic Videos

📅 2025-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key bottlenecks in endoscopic video understanding: the high computational cost of general-purpose video foundation models and the scarcity of domain-specific pretraining data, which jointly limit performance. To this end, we propose EndoMamba—a lightweight and efficient architecture that integrates bidirectional Mamba for spatial modeling and unidirectional Mamba for temporal modeling. We further introduce a hierarchical self-supervised pretraining paradigm jointly optimizing masked video reconstruction and cross-domain feature alignment to mitigate annotation scarcity. Evaluated on four downstream tasks—classification, segmentation, surgical phase recognition, and localization—EndoMamba consistently outperforms both general video foundation models and state-of-the-art domain-specific methods, while maintaining real-time inference speed (≥30 FPS). To our knowledge, EndoMamba is the first endoscopy-adapted foundation model achieving both high accuracy and high efficiency.

Technology Category

Application Category

📝 Abstract
Endoscopic video-based tasks, such as visual navigation and surgical phase recognition, play a crucial role in minimally invasive surgeries by providing real-time assistance. While recent video foundation models have shown promise, their applications are hindered by (1) computational inefficiencies and (2) suboptimal performance caused by limited data for pre-training in endoscopy. To address these issues, we present EndoMamba, a foundation model designed for real-time inference while learning generalized spatiotemporal representations. First, to mitigate computational inefficiencies, we propose the EndoMamba backbone, optimized for real-time inference. Inspired by recent advancements in state space models, EndoMamba integrates Bidirectional Mamba blocks for spatial modeling within individual frames and vanilla Mamba blocks for past-to-present reasoning across the temporal domain. This design enables both strong spatiotemporal modeling and efficient inference in online video streams. Second, we propose a self-supervised hierarchical pre-training diagram to enhance EndoMamba's representation learning using endoscopic videos and incorporating general video domain knowledge. Specifically, our approach combines masked reconstruction with auxiliary supervision, leveraging low-level reconstruction to capture spatial-temporal structures and high-level alignment to transfer broader knowledge from a pretrained general-video domain foundation model. Extensive experiments on four downstream tasks--classification, segmentation, surgical phase recognition, and localization--demonstrate that EndoMamba outperforms existing foundation models and task-specific methods while maintaining real-time inference speed. The source code will be released upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Address computational inefficiencies in endoscopic video models
Improve spatiotemporal representation learning in endoscopy
Enhance real-time inference for endoscopic video tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bidirectional Mamba blocks
self-supervised hierarchical pre-training
real-time inference optimization
🔎 Similar Papers
No similar papers found.
Qingyao Tian
Qingyao Tian
Ph.D. candidate, Institute of Automation, Chinese Academy of Sciences
AI for healthcaremedical imagingfoundation models
H
Huai Liao
Dept. of Radiology, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
X
Xinyan Huang
Dept. of Radiology, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
B
Bingyu Yang
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
D
Dongdong Lei
Centre for Artificial Intelligence and Robotics, Chinese Academy of Sciences, HK, China
Sebastien Ourselin
Sebastien Ourselin
Professor of Healthcare Engineering, King's College London
medical imagingmedical image computingmedical image analysisbiomedical image analysis
H
Hongbin Liu
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; Centre for Artificial Intelligence and Robotics, Chinese Academy of Sciences, HK, China; School of Engineering and Imaging Sciences, King’s College London, UK