Emergent Outlier View Rejection in Visual Geometry Grounded Transformers

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Feed-forward 3D reconstruction from野外 images is highly sensitive to noisy inputs with no viewpoint overlap, as such models lack the geometric verification and outlier rejection mechanisms inherent in traditional Structure-from-Motion (SfM). Method: We observe that vision-geometric Transformers (e.g., VGGT) intrinsically develop view-discriminative representations in their deeper layers—even without additional supervision or fine-tuning—spontaneously suppressing responses to irrelevant views. By analyzing layer-wise activation patterns under both synthetic and real-world noise, we identify and exploit this implicit discriminative representation for training-free view selection. Contribution/Results: Experiments on controlled and野外 datasets demonstrate substantial improvements in robustness and reconstruction accuracy of feed-forward pipelines. This work is the first to uncover the intrinsic noise-robustness mechanism of geometric Transformers, establishing a new paradigm for unsupervised, lightweight 3D reconstruction from野外 imagery.

Technology Category

Application Category

📝 Abstract
Reliable 3D reconstruction from in-the-wild image collections is often hindered by"noisy"images-irrelevant inputs with little or no view overlap with others. While traditional Structure-from-Motion pipelines handle such cases through geometric verification and outlier rejection, feed-forward 3D reconstruction models lack these explicit mechanisms, leading to degraded performance under in-the-wild conditions. In this paper, we discover that the existing feed-forward reconstruction model, e.g., VGGT, despite lacking explicit outlier-rejection mechanisms or noise-aware training, can inherently distinguish distractor images. Through an in-depth analysis under varying proportions of synthetic distractors, we identify a specific layer that naturally exhibits outlier-suppressing behavior. Further probing reveals that this layer encodes discriminative internal representations that enable an effective noise-filtering capability, which we simply leverage to perform outlier-view rejection in feed-forward 3D reconstruction without any additional fine-tuning or supervision. Extensive experiments on both controlled and in-the-wild datasets demonstrate that this implicit filtering mechanism is consistent and generalizes well across diverse scenarios.
Problem

Research questions and friction points this paper is trying to address.

Reject outlier images in feed-forward 3D reconstruction models
Enable robust 3D reconstruction from noisy in-the-wild image collections
Leverage inherent model layers for outlier suppression without retraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages VGGT's internal outlier-suppressing layer
Uses discriminative representations for noise filtering
Performs outlier rejection without fine-tuning or supervision
🔎 Similar Papers
No similar papers found.