🤖 AI Summary
This work addresses the degradation in pose estimation that arises when geometric foundation models (GFMs) are directly employed for SLAM tracking, due to their inherent geometric inaccuracies. To mitigate this issue, the authors propose a decoupled framework that leverages classical feature-based SLAM for robust, low-latency pose tracking, while delegating dense mapping exclusively to the GFM. Scale consistency in reconstruction is preserved through pose anchoring and a lightweight submap-scale optimization module. This approach represents the first effort to decouple SLAM from GFMs, effectively preventing the propagation of GFM-induced errors into the pose estimation pipeline. Experimental results demonstrate reconstruction errors of 2 cm per 10 m indoors and 10 cm per 30 m outdoors, with trajectory accuracy and reconstruction quality improving by 10%–20% over state-of-the-art methods in large-scale indoor and outdoor environments.
📝 Abstract
Recent works have explored unifying SLAM with geometric foundation models (GFMs). However, directly using GFM predictions for tracking is highly sensitive to model capability and uncertainty, as geometric inaccuracies in the predictions can adversely affect pose estimation. To address this limitation, we present a decoupled framework that integrates classical feature-based SLAM with GFMs, which achieves higher quality and more consistent dense reconstruction. In brief, we use classical visual SLAM for robust low-latency tracking and use GFMs exclusively for mapping. By anchoring mapping to poses produced by the SLAM module and optimizing across depth scales, the proposed design avoids propagating inaccuracies from GFM predictions into pose estimation while imposing geometric constraints on the reconstruction. The system builds submaps from multiple posed keyframes and enforces scale consistency via lightweight frame and submap scale optimization. It also performs projection-based point cloud fusion within each submap, and updates submaps online to reflect trajectory updates from the feature-based SLAM. To evaluate tracking and reconstruction of our method, we introduce a loop-rich, building-scale indoor dataset with accurate sensor trajectories and LiDAR ground-truth. Experiments show that our approach achieves superior trajectory accuracy while improving reconstruction precision by 10%-20% over existing methods, with about 2 cm reconstruction error per 10 m chunk on building-scale dataset. On large-scale outdoor datasets, it attains 10 cm error per 30 m chunk (w.r.t LiDAR ground-truth models).