🤖 AI Summary
This work addresses the challenge of balancing efficiency and fine-grained semantic representation in open-set robotic tasks. To this end, we propose BiMoSG, a novel bimodal 3D scene graph generation framework that first constructs a coarse-grained scene graph efficiently via a “fast” mode and dynamically switches to a “slow” mode in task-relevant regions to produce open-vocabulary, fine-grained 3D semantic representations. This approach enables runtime, on-demand adjustment of representational granularity, facilitating real-time co-optimization between scene understanding and task execution. Experimental results demonstrate that BiMoSG significantly outperforms current state-of-the-art open-source methods, achieving substantially faster generation speeds while preserving rich semantic detail, thereby enabling practical real-time deployment for open-set robotic applications.
📝 Abstract
Open-set task execution can significantly benefit from seamlessly switching between coarse and fine scene representations depending on the context and the evolving information as the robot explores the environment. For example, it is often sufficient to start with a coarse scene representation initially and only employ a finer, more granular scene representation when the robot encounters regions which are likely to contain the task relevant objects. Hence, in this work, we propose BiMoSG, a bimodal 3D scene graph generation approach for open-set tasks. BiMoSG employs a "fast" mode by default to efficiently generate a coarse 3D scene graph and can switch to a "slow" mode for generating a finer open vocabulary 3D scene graph of task relevant objects. We demonstrate that our proposed 3D scene graph generation approach is significantly faster than the open-source state-of-the-art approaches. This allows us to integrate the scene graph generation process with task execution for real-time deployment.