DEFOM-Stereo: Depth Foundation Model Based Stereo Matching

📅 2025-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address inaccurate depth estimation in stereo matching caused by occlusions and textureless regions, this paper proposes the first Depth Foundation Model (DEFOM)-driven iterative stereo matching framework. Methodologically: (1) we design a CNN-DEFOM joint feature encoder to integrate local geometric details with global semantic context; (2) we initialize disparity using a monocular relative depth foundation model and introduce a scale-adaptive iterative refinement mechanism; (3) we establish a monocular-biocular co-modeling paradigm coupled with zero-shot transfer learning. Our method achieves state-of-the-art performance on KITTI 2012/2015, Middlebury, and ETH3D benchmarks. On Scene Flow, it demonstrates superior zero-shot generalization compared to existing approaches. Furthermore, it ranks first in comprehensive evaluations across multiple tracks of the Robust Vision Challenge. These results validate the effectiveness of leveraging depth foundation models for robust, generalizable stereo matching.

Technology Category

Application Category

📝 Abstract
Stereo matching is a key technique for metric depth estimation in computer vision and robotics. Real-world challenges like occlusion and non-texture hinder accurate disparity estimation from binocular matching cues. Recently, monocular relative depth estimation has shown remarkable generalization using vision foundation models. Thus, to facilitate robust stereo matching with monocular depth cues, we incorporate a robust monocular relative depth model into the recurrent stereo-matching framework, building a new framework for depth foundation model-based stereo-matching, DEFOM-Stereo. In the feature extraction stage, we construct the combined context and matching feature encoder by integrating features from conventional CNNs and DEFOM. In the update stage, we use the depth predicted by DEFOM to initialize the recurrent disparity and introduce a scale update module to refine the disparity at the correct scale. DEFOM-Stereo is verified to have comparable performance on the Scene Flow dataset with state-of-the-art (SOTA) methods and notably shows much stronger zero-shot generalization. Moreover, DEFOM-Stereo achieves SOTA performance on the KITTI 2012, KITTI 2015, Middlebury, and ETH3D benchmarks, ranking 1st on many metrics. In the joint evaluation under the robust vision challenge, our model simultaneously outperforms previous models on the individual benchmarks. Both results demonstrate the outstanding capabilities of the proposed model.
Problem

Research questions and friction points this paper is trying to address.

Stereoscopic Image Matching
Object Occlusion
Texture-less Surfaces
Innovation

Methods, ideas, or system contributions that make the work stand out.

DEFOM-Stereo
Depth-aware Stereo Matching
Occlusion Handling
🔎 Similar Papers
No similar papers found.
Hualie Jiang
Hualie Jiang
Insta360/Antigravity
Computer Vision3D VisionOmnidirectional Vision
Z
Zhiqiang Lou
Insta360 Research
L
Laiyan Ding
The Chinese University of Hong Kong, Shenzhen
R
Rui Xu
Insta360 Research
M
Minglang Tan
Insta360 Research
W
Wenjie Jiang
Insta360 Research
R
Rui Huang
The Chinese University of Hong Kong, Shenzhen