🤖 AI Summary
Existing SLAM methods struggle to jointly model scene geometry, semantics, and instances in real time, limiting 3D understanding in robotics and AR applications. This paper introduces the first end-to-end RGB-D video-driven panoptic 3D scene reconstruction SLAM system, unifying geometric mapping, 3D semantic segmentation, and 3D instance segmentation. Key contributions include: (1) the first integration of panoptic segmentation into a SLAM framework; (2) an online spatiotemporal lifting (STL) module that refines 2D pseudo-labels across views and constructs a consistent, robust 3D Gaussian panoptic representation; and (3) joint rendering of depth, color, semantic, and instance maps via enhanced 3D Gaussian rasterization, incorporating vision large model priors and multi-view geometric consistency optimization. Evaluated on open-world RGB-D sequences, our method achieves the first real-time panoptic 3D reconstruction, outperforming state-of-the-art semantic SLAM approaches in both mapping accuracy and tracking robustness.
📝 Abstract
Understanding geometric, semantic, and instance information in 3D scenes from sequential video data is essential for applications in robotics and augmented reality. However, existing Simultaneous Localization and Mapping (SLAM) methods generally focus on either geometric or semantic reconstruction. In this paper, we introduce PanoSLAM, the first SLAM system to integrate geometric reconstruction, 3D semantic segmentation, and 3D instance segmentation within a unified framework. Our approach builds upon 3D Gaussian Splatting, modified with several critical components to enable efficient rendering of depth, color, semantic, and instance information from arbitrary viewpoints. To achieve panoptic 3D scene reconstruction from sequential RGB-D videos, we propose an online Spatial-Temporal Lifting (STL) module that transfers 2D panoptic predictions from vision models into 3D Gaussian representations. This STL module addresses the challenges of label noise and inconsistencies in 2D predictions by refining the pseudo labels across multi-view inputs, creating a coherent 3D representation that enhances segmentation accuracy. Our experiments show that PanoSLAM outperforms recent semantic SLAM methods in both mapping and tracking accuracy. For the first time, it achieves panoptic 3D reconstruction of open-world environments directly from the RGB-D video. (https://github.com/runnanchen/PanoSLAM)