🤖 AI Summary
Traditional approaches to long-horizon, language-guided mobile manipulation in dynamic real-world environments fail due to their static-scene assumption, rendering them unable to adapt to continuous changes induced by human–robot interaction or robot motion. To address this, we propose a dynamic open-vocabulary 3D scene graph that enables incremental, online updates of local graph structures—overcoming the limitations of static scene modeling. Our method integrates RGB-D perception, vision-language model-based object detection, 3D instance segmentation, and graph neural networks to construct a hierarchical, semantics-geometry joint scene graph reasoning framework. Evaluated in realistic dynamic environments, it achieves significant improvements: +28.6% in long-horizon task success rate and 63% reduction in scene-update latency compared to static graphs and periodic reconstruction baselines. The core contribution is the first dynamic 3D scene graph representation and reasoning paradigm supporting open-vocabulary grounding, online structural evolution, and long-sequence instruction execution.
📝 Abstract
Enabling mobile robots to perform long-term tasks in dynamic real-world environments is a formidable challenge, especially when the environment changes frequently due to human-robot interactions or the robot's own actions. Traditional methods typically assume static scenes, which limits their applicability in the continuously changing real world. To overcome these limitations, we present DovSG, a novel mobile manipulation framework that leverages dynamic open-vocabulary 3D scene graphs and a language-guided task planning module for long-term task execution. DovSG takes RGB-D sequences as input and utilizes vision-language models (VLMs) for object detection to obtain high-level object semantic features. Based on the segmented objects, a structured 3D scene graph is generated for low-level spatial relationships. Furthermore, an efficient mechanism for locally updating the scene graph, allows the robot to adjust parts of the graph dynamically during interactions without the need for full scene reconstruction. This mechanism is particularly valuable in dynamic environments, enabling the robot to continually adapt to scene changes and effectively support the execution of long-term tasks. We validated our system in real-world environments with varying degrees of manual modifications, demonstrating its effectiveness and superior performance in long-term tasks. Our project page is available at: https://bjhyzj.github.io/dovsg-web.