🤖 AI Summary
To address the scarcity of high-quality ground-truth depth annotations and severe label noise in real-world endoscopic imagery—leading to poor generalization of depth estimation models—this work proposes the first zero-shot cross-dataset depth estimation method specifically designed for endoscopic video. Methodologically, we (1) construct an endoscopy-specific foundational depth estimation model; (2) design a teacher-confidence-guided robust self-training framework to mitigate annotation noise; and (3) introduce a weighted scale- and translation-invariant loss to adaptively suppress erroneous pixel predictions. Experiments demonstrate that our method achieves a 33% reduction in absolute relative error over the medical-domain state-of-the-art and a 34% improvement over existing general-purpose foundation models on zero-shot relative depth estimation. Moreover, it provides a strong initialization for downstream fine-tuning, consistently outperforming prior approaches across diverse endoscopic domains.
📝 Abstract
Single-image depth estimation is essential for endoscopy tasks such as localization, reconstruction, and augmented reality. Most existing methods in surgical scenes focus on in-domain depth estimation, limiting their real-world applicability. This constraint stems from the scarcity and inferior labeling quality of medical data for training. In this work, we present EndoOmni, the first foundation model for zero-shot cross-domain depth estimation for endoscopy. To harness the potential of diverse training data, we refine the advanced self-learning paradigm that employs a teacher model to generate pseudo-labels, guiding a student model trained on large-scale labeled and unlabeled data. To address training disturbance caused by inherent noise in depth labels, we propose a robust training framework that leverages both depth labels and estimated confidence from the teacher model to jointly guide the student model training. Moreover, we propose a weighted scale-and-shift invariant loss to adaptively adjust learning weights based on label confidence, thus imposing learning bias towards cleaner label pixels while reducing the influence of highly noisy pixels. Experiments on zero-shot relative depth estimation show that our EndoOmni improves state-of-the-art methods in medical imaging for 33% and existing foundation models for 34% in terms of absolute relative error on specific datasets. Furthermore, our model provides strong initialization for fine-tuning metric depth estimation, maintaining superior performance in both in-domain and out-of-domain scenarios. The source code is publicly available at https://github.com/TianCuteQY/EndoOmni.