🤖 AI Summary
This work addresses the challenges of long-term online visual mapping—namely, train-test mismatch due to a fixed global coordinate frame, attention bias toward early anchor points, and trajectory drift—by proposing a current-frame-centric, streaming feedforward reconstruction framework. The method abandons the global coordinate system entirely and instead introduces a transient anchor mechanism that reformulates 3D reconstruction as relative pose estimation and point cloud mapping within a local temporal window. Global consistency is preserved through integrated loop closure detection and motion averaging. Evaluated across diverse indoor and outdoor RGB-D and driving datasets, the approach substantially improves pose accuracy and dense reconstruction quality over long sequences while enabling low-memory online inference.
📝 Abstract
Long-horizon online visual mapping is a core capability for robot perception, requiring continuous camera-motion and scene-geometry estimation from visual streams under bounded memory and computation. Recent feed-forward 3D reconstruction models provide strong geometric priors, but their streaming variants often predict poses in a fixed coordinate system tied to the first frame or a persistent scene memory. This fixed-gauge design leads to train--test mismatch, attention bias toward early anchors, and accumulated drift on sequences much longer than those seen during training. We propose \emph{Anchor3R}, a streaming 3D reconstruction framework that treats feed-forward reconstruction as current-centric local measurement prediction rather than persistent global-gauge regression. At each time step, Anchor3R predicts window-relative poses and a local pointmap in the current-frame coordinate system, turning streaming reconstruction into relative-pose measurement generation. These measurements support online pose updates, while loop-closure reinsertion and motion averaging align the trajectory and transform local pointmaps into a coherent global reconstruction. Experiments on indoor, outdoor, driving, and RGB-D benchmarks show that Anchor3R improves long-horizon pose accuracy and dense reconstruction quality over existing streaming baselines, while supporting bounded-memory online inference.