🤖 AI Summary
Existing monocular, stereo, and RGB-D SLAM methods struggle to simultaneously achieve high-fidelity novel-view synthesis and photorealistic 3D reconstruction—particularly in monocular settings, where geometric ambiguity severely limits performance. To address this, we propose a multimodal-input-aware, high-fidelity SLAM framework. Our key contributions are: (1) an appearance-motion joint embedding mechanism that enforces cross-view consistency in scene representation; and (2) a frequency-regularized pyramid that hierarchically constrains both geometry and appearance details of 3D Gaussian Splatting (3DGS). Leveraging motion-consistency priors, multi-scale frequency-domain regularization, and end-to-end joint optimization, our method achieves state-of-the-art results on benchmarks including TUM RGB-D: +16.76 dB PSNR improvement in monocular scenes, while enabling real-time, robust, sensor-agnostic photorealistic reconstruction and mapping.
📝 Abstract
3D Gaussian Splatting (3DGS) has recently revolutionized novel view synthesis in the Simultaneous Localization and Mapping (SLAM). However, existing SLAM methods utilizing 3DGS have failed to provide high-quality novel view rendering for monocular, stereo, and RGB-D cameras simultaneously. Notably, some methods perform well for RGB-D cameras but suffer significant degradation in rendering quality for monocular cameras. In this paper, we present Scaffold-SLAM, which delivers simultaneous localization and high-quality photorealistic mapping across monocular, stereo, and RGB-D cameras. We introduce two key innovations to achieve this state-of-the-art visual quality. First, we propose Appearance-from-Motion embedding, enabling 3D Gaussians to better model image appearance variations across different camera poses. Second, we introduce a frequency regularization pyramid to guide the distribution of Gaussians, allowing the model to effectively capture finer details in the scene. Extensive experiments on monocular, stereo, and RGB-D datasets demonstrate that Scaffold-SLAM significantly outperforms state-of-the-art methods in photorealistic mapping quality, e.g., PSNR is 16.76% higher in the TUM RGB-D datasets for monocular cameras.