SwiftVGGT: A Scalable Visual Geometry Grounded Transformer for Large-Scale Scenes

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address the trade-off between accuracy and efficiency in large-scale 3D reconstruction, this paper proposes a training-free Visual-Geometric Transformer (VG-Transformer). Our method innovatively integrates visual-geometric priors with self-attention mechanisms to enable end-to-end dense reconstruction. We introduce a point-based block alignment strategy and a single-step Sim(3) SVD pose optimization—replacing conventional iterative refinement—to achieve efficient loop closure detection and globally consistent alignment without external localization models. Experiments demonstrate state-of-the-art reconstruction quality across multiple large-scale benchmarks, with inference time reduced to only 33% of that of VGGT. This significantly enhances scalability and practicality for high-accuracy 3D reconstruction in real-world large-scale scenarios.

Technology Category

Application Category

📝 Abstract

3D reconstruction in large-scale scenes is a fundamental task in 3D perception, but the inherent trade-off between accuracy and computational efficiency remains a significant challenge. Existing methods either prioritize speed and produce low-quality results, or achieve high-quality reconstruction at the cost of slow inference times. In this paper, we propose SwiftVGGT, a training-free method that significantly reduce inference time while preserving high-quality dense 3D reconstruction. To maintain global consistency in large-scale scenes, SwiftVGGT performs loop closure without relying on the external Visual Place Recognition (VPR) model. This removes redundant computation and enables accurate reconstruction over kilometer-scale environments. Furthermore, we propose a simple yet effective point sampling method to align neighboring chunks using a single Sim(3)-based Singular Value Decomposition (SVD) step. This eliminates the need for the Iteratively Reweighted Least Squares (IRLS) optimization commonly used in prior work, leading to substantial speed-ups. We evaluate SwiftVGGT on multiple datasets and show that it achieves state-of-the-art reconstruction quality while requiring only 33% of the inference time of recent VGGT-based large-scale reconstruction approaches.

Problem

Research questions and friction points this paper is trying to address.

Balancing accuracy and computational efficiency in large-scale 3D reconstruction

Achieving loop closure without external Visual Place Recognition models

Replacing iterative optimization with efficient Sim(3)-based SVD alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free method reduces inference time

Loop closure without external VPR model

Sim(3)-based SVD replaces IRLS optimization

🔎 Similar Papers

SimPLR: A Simple and Plain Transformer for Efficient Object Detection and Segmentation