🤖 AI Summary
This work addresses the limitation of existing 3D scene graph methods, which treat perception as a post-processing step on static datasets and decouple scene understanding from observation planning, thereby hindering long-term, incremental environment modeling for robots. To overcome this, the paper introduces an online semantic exploration framework that, for the first time, formulates semantic scene completeness as an active optimization objective. The approach employs an uncertainty-guided traversal strategy to dynamically balance semantic verification, geometric coverage, and motion cost. By fusing RGB-D observations with a prior 2D occupancy map, it incrementally constructs an uncertainty-aware 3D scene graph encoding open-vocabulary object label posteriors and structural relational edges, which in turn drives closed-loop path planning. The system autonomously revisits semantically ambiguous regions and explores unknown spaces, enabling continuous, human-intervention-free patrolling, updating, and reasoning.
📝 Abstract
Robots that operate over extended periods should not merely visit space; they should progressively understand it. Yet most 3D scene graph pipelines treat perception as a post-processing stage over a fixed dataset, decoupling scene representation from the decisions that determine what is observed in the first place. We present SCOUT, an online semantic exploration framework that closes this loop by coupling active traversal with probabilistic scene graph construction. Given a prior 2D occupancy map and posed RGB-D observations, SCOUT incrementally builds an uncertainty-aware 3D scene graph whose nodes maintain fused geometry and posterior beliefs over open-vocabulary object labels, while edges encode structural relations such as on, inside, belong, and next to. These beliefs are fed back to an uncertainty-guided traversal planner, which selects viewpoints by balancing expected semantic certainty gain, geometric coverage gain, and travel cost. In this way, the robot revisits ambiguous objects when additional evidence matters and expands into unseen free space when the scene remains incomplete. The resulting system treats semantic scene completeness as an operational objective rather than a passive by-product of semantic mapping, moving toward autonomous agents that can patrol, update, and reason about evolving indoor environments with minimal human intervention.