EndoMatcher: Generalizable Endoscopic Image Matcher via Multi-Domain Pre-training for Robot-Assisted Surgery

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of weak texture, large viewpoint variations, and scarce annotations in dense feature matching for endoscopic images in robotic-assisted surgery, this paper proposes the first large-scale pretraining framework tailored for multi-domain endoscopic image matching. We introduce Endo-Mix6, a cross-organ, cross-imaging-condition dataset comprising 1.2 million image pairs. Our method employs a dual-branch Vision Transformer architecture with progressive multi-objective training and a novel dual-interaction module to enhance correspondence modeling. High-quality pseudo-labels are generated via Structure-from-Motion (SfM) and synthetic geometric transformations. The framework achieves zero-shot cross-domain generalization: it improves inlier counts by 140.7% on Hamlyn and 201.4% on Bladder datasets, and boosts directional prediction accuracy on Gastro-Matching by 9.40%, significantly outperforming state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
Generalizable dense feature matching in endoscopic images is crucial for robot-assisted tasks, including 3D reconstruction, navigation, and surgical scene understanding. Yet, it remains a challenge due to difficult visual conditions (e.g., weak textures, large viewpoint variations) and a scarcity of annotated data. To address these challenges, we propose EndoMatcher, a generalizable endoscopic image matcher via large-scale, multi-domain data pre-training. To address difficult visual conditions, EndoMatcher employs a two-branch Vision Transformer to extract multi-scale features, enhanced by dual interaction blocks for robust correspondence learning. To overcome data scarcity and improve domain diversity, we construct Endo-Mix6, the first multi-domain dataset for endoscopic matching. Endo-Mix6 consists of approximately 1.2M real and synthetic image pairs across six domains, with correspondence labels generated using Structure-from-Motion and simulated transformations. The diversity and scale of Endo-Mix6 introduce new challenges in training stability due to significant variations in dataset sizes, distribution shifts, and error imbalance. To address them, a progressive multi-objective training strategy is employed to promote balanced learning and improve representation quality across domains. This enables EndoMatcher to generalize across unseen organs and imaging conditions in a zero-shot fashion. Extensive zero-shot matching experiments demonstrate that EndoMatcher increases the number of inlier matches by 140.69% and 201.43% on the Hamlyn and Bladder datasets over state-of-the-art methods, respectively, and improves the Matching Direction Prediction Accuracy (MDPA) by 9.40% on the Gastro-Matching dataset, achieving dense and accurate matching under challenging endoscopic conditions. The code is publicly available at https://github.com/Beryl2000/EndoMatcher.
Problem

Research questions and friction points this paper is trying to address.

Generalizable dense feature matching in endoscopic images for robot-assisted surgery tasks
Overcoming difficult visual conditions and scarcity of annotated data in endoscopic matching
Addressing dataset diversity challenges to improve zero-shot matching performance across domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-branch Vision Transformer for multi-scale features
Endo-Mix6 multi-domain dataset with 1.2M image pairs
Progressive multi-objective training for balanced learning
🔎 Similar Papers
No similar papers found.
B
Bingyu Yang
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China and School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
Qingyao Tian
Qingyao Tian
Ph.D. candidate, Institute of Automation, Chinese Academy of Sciences
AI for healthcaremedical imagingfoundation models
Y
Yimeng Geng
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China and School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
H
Huai Liao
Department of Pulmonary and Critical Care Medicine, The First Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong 510275, China
X
Xinyan Huang
Department of Pulmonary and Critical Care Medicine, The First Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong 510275, China
J
Jiebo Luo
Hong Kong Institute of Science & Innovation, Hong Kong SAR
H
Hongbin Liu
Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China and Centre of AI and Robotics, Hong Kong Institute of Science & Innovation, Hong Kong SAR