🤖 AI Summary
This paper addresses the challenges of unsupervised multi-frame depth estimation from monocular endoscopic videos—specifically, poor/textureless regions and inter-frame brightness variations that hinder temporal consistency modeling. To this end, we propose a novel self-supervised framework comprising: (1) a learnable PatchMatch module to enhance matching robustness in weak-texture areas; (2) cross-frame–intra-frame joint distillation regularization, which jointly leverages cross-frame teaching and intra-frame self-teaching consistency to suppress brightness variation artifacts; and (3) test-time dynamic multi-frame fusion for improved generalization and depth accuracy. Our method achieves state-of-the-art performance on four major endoscopic datasets—SCARED, EndoSLAM, Hamlyn, and SERV-CT—outperforming prior approaches across all benchmarks. The code and pretrained models are publicly released.
📝 Abstract
This work delves into unsupervised monocular depth estimation in endoscopy, which leverages adjacent frames to establish a supervisory signal during the training phase. For many clinical applications, e.g., surgical navigation, temporally correlated frames are also available at test time. Due to the lack of depth clues, making full use of the temporal correlation among multiple video frames at both phases is crucial for accurate depth estimation. However, several challenges in endoscopic scenes, such as low and homogeneous textures and inter-frame brightness fluctuations, limit the performance gain from the temporal correlation. To fully exploit it, we propose a novel unsupervised multi-frame monocular depth estimation model. The proposed model integrates a learnable patchmatch module to adaptively increase the discriminative ability in regions with low and homogeneous textures, and enforces cross-teaching and self-teaching consistencies to provide efficacious regularizations towards brightness fluctuations. Furthermore, as a byproduct of the self-teaching paradigm, the proposed model is able to improve the depth predictions when more frames are input at test time. We conduct detailed experiments on multiple datasets, including SCARED, EndoSLAM, Hamlyn and SERV-CT. The experimental results indicate that our model exceeds the state-of-the-art competitors. The source code and trained models will be publicly available upon the acceptance.