🤖 AI Summary
Semantic SLAM exhibits limited robustness in real-world scenarios with severe occlusion, geometric ambiguity, or incomplete data, and struggles to integrate high-level spatial and semantic priors essential for human cognition. This paper proposes a human-robot collaborative semantic SLAM framework enabled by extended reality (XR), facilitating real-time co-authoring and shared access to dynamic 3D scene graphs between human operators and robots. Key contributions include: (1) a graph-structured semantic fusion mechanism supporting heterogeneous perception inputs (RGB-D, IMU, and manual annotations) alongside high-level semantic injection; and (2) tightly coupled scene graph modeling with incremental graph optimization to enable context-aware enhancement and scalable collaboration. Evaluated on a real-world construction site dataset, the method achieves an 18.7% improvement in room detection accuracy, a 22.3% gain in geometric map precision, and significantly enhanced semantic completeness over fully automated baselines.
📝 Abstract
Semantic SLAM (Simultaneous Localization and Mapping) systems enrich robot maps with structural and semantic information, enabling robots to operate more effectively in complex environments. However, these systems struggle in real-world scenarios with occlusions, incomplete data, or ambiguous geometries, as they cannot fully leverage the higher-level spatial and semantic knowledge humans naturally apply. We introduce HICS-SLAM, a Human-in-the-Loop semantic SLAM framework that uses a shared extended reality environment for real-time collaboration. The system allows human operators to directly interact with and visualize the robot's 3D scene graph, and add high-level semantic concepts (e.g., rooms or structural entities) into the mapping process. We propose a graph-based semantic fusion methodology that integrates these human interventions with robot perception, enabling scalable collaboration for enhanced situational awareness. Experimental evaluations on real-world construction site datasets demonstrate improvements in room detection accuracy, map precision, and semantic completeness compared to automated baselines, demonstrating both the effectiveness of the approach and its potential for future extensions.