🤖 AI Summary
This work addresses geometric inconsistency and metric drift in monocular SLAM caused by viewpoint dependency and sensor noise by introducing a novel approach that integrates a feedforward geometric foundation model with a global 3D Gaussian splatting representation. The method stabilizes scene geometry through an expectation-maximization framework and achieves robust monocular pose estimation by incorporating iterative closest point (ICP) alignment. Furthermore, it parameterizes multimodal features directly on the Gaussian splatting map, enabling downstream tasks such as in-place open-set segmentation. Experimental evaluations on the 7-Scenes, TUM RGB-D, and Replica datasets demonstrate that the proposed approach consistently outperforms recent baselines in both geometric consistency and localization accuracy.
📝 Abstract
Feed-forward geometric foundation models can infer dense point clouds and camera motion directly from RGB streams, providing priors for monocular SLAM. However, their predictions are often view-dependent and noisy: geometry can vary across viewpoints and under image transformations, and local metric properties may drift between frames. We present MonoEM-GS, a monocular mapping pipeline that integrates such geometric predictions into a global Gaussian Splatting representation while explicitly addressing these inconsistencies. MonoEM-GS couples Gaussian Splatting with an Expectation--Maximization formulation to stabilize geometry, and employs ICP-based alignment for monocular pose estimation. Beyond geometry, MonoEM-GS parameterizes Gaussians with multi-modal features, enabling in-place open-set segmentation and other downstream queries directly on the reconstructed map. We evaluate MonoEM-GS on 7-Scenes, TUM RGB-D and Replica, and compare against recent baselines.