🤖 AI Summary
This work investigates the emergent geometric understanding of 3D foundation models (3DFMs) under extreme, non-overlapping viewpoints—without explicit training for such scenarios. We discover that their internal representations spontaneously capture structural geometry under extreme viewing conditions. To leverage this property, we propose a lightweight alignment method: only a small subset of bias parameters in the backbone is fine-tuned, while depth, point cloud, and pose prediction heads remain entirely frozen—eliminating decoder involvement and ensuring computational efficiency. To enable systematic evaluation, we introduce MegaUnScene, the first benchmark tailored to extreme-view geometry in real-world internet scenes. Experiments demonstrate substantial improvements in relative pose estimation accuracy, without degrading single-image depth or point cloud reconstruction quality. Validation on MegaUnScene confirms strong generalization and practical utility of the approach.
📝 Abstract
3D foundation models (3DFMs) have recently transformed 3D vision, enabling joint prediction of depths, poses, and point maps directly from images. Yet their ability to reason under extreme, non-overlapping views remains largely unexplored. In this work, we study their internal representations and find that 3DFMs exhibit an emergent understanding of extreme-view geometry, despite never being trained for such conditions. To further enhance these capabilities, we introduce a lightweight alignment scheme that refines their internal 3D representation by tuning only a small subset of backbone bias terms, leaving all decoder heads frozen. This targeted adaptation substantially improves relative pose estimation under extreme viewpoints without degrading per-image depth or point quality. Additionally, we contribute MegaUnScene, a new benchmark of Internet scenes unseen by existing 3DFMs, with dedicated test splits for both relative pose estimation and dense 3D reconstruction. All code and data will be released.