S-MUSt3R: Sliding Multi-view 3D Reconstruction

πŸ“… 2026-02-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of scaling 3D reconstruction with foundation models on long monocular RGB video sequences, where memory constraints typically hinder performance. The authors propose an efficient, training-free extension strategy that enables the high-performance 3D foundation model MUSt3R to operate on large-scale monocular scenes for the first time. By integrating sequential sliding window segmentation, multi-view geometric alignment, and lightweight loop closure optimization, the method achieves accurate, consistent, and scalable reconstructions in an end-to-end metric space. Evaluated on the TUM, 7-Scenes, and robotic navigation datasets, the approach matches the trajectory accuracy and reconstruction quality of traditional, more complex systems while supporting real-time inference on long video sequences.

Technology Category

Application Category

πŸ“ Abstract
The recent paradigm shift in 3D vision led to the rise of foundation models with remarkable capabilities in 3D perception from uncalibrated images. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. This work proposes S-MUSt3R, a simple and efficient pipeline that extends the limits of foundation models for monocular 3D reconstruction. Our approach addresses the scalability bottleneck of foundation models through a simple strategy of sequence segmentation followed by segment alignment and lightweight loop closure optimization. Without model retraining, we benefit from remarkable 3D reconstruction capacities of MUSt3R model and achieve trajectory and reconstruction performance comparable to traditional methods with more complex architecture. We evaluate S-MUSt3R on TUM, 7-Scenes and proprietary robot navigation datasets and show that S-MUSt3R runs successfully on long RGB sequences and produces accurate and consistent 3D reconstruction. Our results highlight the potential of leveraging the MUSt3R model for scalable monocular 3D scene in real-world settings, with an important advantage of making predictions directly in the metric space.
Problem

Research questions and friction points this paper is trying to address.

3D reconstruction
foundation models
scalability
monocular vision
RGB sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

foundation model
monocular 3D reconstruction
sequence segmentation
loop closure
metric space
πŸ”Ž Similar Papers