Scaffold-SLAM: Structured 3D Gaussians for Simultaneous Localization and Photorealistic Mapping

📅 2025-01-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing monocular, stereo, and RGB-D SLAM methods struggle to simultaneously achieve high-fidelity novel-view synthesis and photorealistic 3D reconstruction—particularly in monocular settings, where geometric ambiguity severely limits performance. To address this, we propose a multimodal-input-aware, high-fidelity SLAM framework. Our key contributions are: (1) an appearance-motion joint embedding mechanism that enforces cross-view consistency in scene representation; and (2) a frequency-regularized pyramid that hierarchically constrains both geometry and appearance details of 3D Gaussian Splatting (3DGS). Leveraging motion-consistency priors, multi-scale frequency-domain regularization, and end-to-end joint optimization, our method achieves state-of-the-art results on benchmarks including TUM RGB-D: +16.76 dB PSNR improvement in monocular scenes, while enabling real-time, robust, sensor-agnostic photorealistic reconstruction and mapping.

Technology Category

Application Category

📝 Abstract
3D Gaussian Splatting (3DGS) has recently revolutionized novel view synthesis in the Simultaneous Localization and Mapping (SLAM). However, existing SLAM methods utilizing 3DGS have failed to provide high-quality novel view rendering for monocular, stereo, and RGB-D cameras simultaneously. Notably, some methods perform well for RGB-D cameras but suffer significant degradation in rendering quality for monocular cameras. In this paper, we present Scaffold-SLAM, which delivers simultaneous localization and high-quality photorealistic mapping across monocular, stereo, and RGB-D cameras. We introduce two key innovations to achieve this state-of-the-art visual quality. First, we propose Appearance-from-Motion embedding, enabling 3D Gaussians to better model image appearance variations across different camera poses. Second, we introduce a frequency regularization pyramid to guide the distribution of Gaussians, allowing the model to effectively capture finer details in the scene. Extensive experiments on monocular, stereo, and RGB-D datasets demonstrate that Scaffold-SLAM significantly outperforms state-of-the-art methods in photorealistic mapping quality, e.g., PSNR is 16.76% higher in the TUM RGB-D datasets for monocular cameras.
Problem

Research questions and friction points this paper is trying to address.

SLAM
Monocular Camera
Visual Mapping
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaffold-SLAM
Appearance Embedding in Motion
Frequency Regularized Pyramid
🔎 Similar Papers
No similar papers found.