🤖 AI Summary
Feed-forward 3D reconstruction from野外 images is highly sensitive to noisy inputs with no viewpoint overlap, as such models lack the geometric verification and outlier rejection mechanisms inherent in traditional Structure-from-Motion (SfM).
Method: We observe that vision-geometric Transformers (e.g., VGGT) intrinsically develop view-discriminative representations in their deeper layers—even without additional supervision or fine-tuning—spontaneously suppressing responses to irrelevant views. By analyzing layer-wise activation patterns under both synthetic and real-world noise, we identify and exploit this implicit discriminative representation for training-free view selection.
Contribution/Results: Experiments on controlled and野外 datasets demonstrate substantial improvements in robustness and reconstruction accuracy of feed-forward pipelines. This work is the first to uncover the intrinsic noise-robustness mechanism of geometric Transformers, establishing a new paradigm for unsupervised, lightweight 3D reconstruction from野外 imagery.
📝 Abstract
Reliable 3D reconstruction from in-the-wild image collections is often hindered by"noisy"images-irrelevant inputs with little or no view overlap with others. While traditional Structure-from-Motion pipelines handle such cases through geometric verification and outlier rejection, feed-forward 3D reconstruction models lack these explicit mechanisms, leading to degraded performance under in-the-wild conditions. In this paper, we discover that the existing feed-forward reconstruction model, e.g., VGGT, despite lacking explicit outlier-rejection mechanisms or noise-aware training, can inherently distinguish distractor images. Through an in-depth analysis under varying proportions of synthetic distractors, we identify a specific layer that naturally exhibits outlier-suppressing behavior. Further probing reveals that this layer encodes discriminative internal representations that enable an effective noise-filtering capability, which we simply leverage to perform outlier-view rejection in feed-forward 3D reconstruction without any additional fine-tuning or supervision. Extensive experiments on both controlled and in-the-wild datasets demonstrate that this implicit filtering mechanism is consistent and generalizes well across diverse scenarios.