Open-Vocabulary Indoor Object Grounding with 3D Hierarchical Scene Graph

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

This work addresses the challenges of precise open-vocabulary indoor object grounding and complex spatial relation querying. Methodologically, it introduces a multimodal reasoning framework built upon a 3D hierarchical scene graph: (1) an RGB-D–driven hierarchical scene graph explicitly models intra-layer and inter-layer spatial relations; (2) semantic-geometric joint representations are achieved by fusing open-vocabulary foundation models (e.g., CLIP) with geometry-aware features; and (3) a large language model (LLM) performs multi-step symbolic spatial reasoning to enhance understanding of long-tail queries and context-dependent relations. Experiments on the Habitat-Matterport 3D dataset demonstrate significant improvements in semantic consistency and geometric localization accuracy—achieving an average 12.7% mAP gain over baselines on complex spatial querying tasks—while maintaining robustness and interpretability.

Technology Category

Application Category

📝 Abstract

We propose OVIGo-3DHSG method - Open-Vocabulary Indoor Grounding of objects using 3D Hierarchical Scene Graph. OVIGo-3DHSG represents an extensive indoor environment over a Hierarchical Scene Graph derived from sequences of RGB-D frames utilizing a set of open-vocabulary foundation models and sensor data processing. The hierarchical representation explicitly models spatial relations across floors, rooms, locations, and objects. To effectively address complex queries involving spatial reference to other objects, we integrate the hierarchical scene graph with a Large Language Model for multistep reasoning. This integration leverages inter-layer (e.g., room-to-object) and intra-layer (e.g., object-to-object) connections, enhancing spatial contextual understanding. We investigate the semantic and geometry accuracy of hierarchical representation on Habitat Matterport 3D Semantic multi-floor scenes. Our approach demonstrates efficient scene comprehension and robust object grounding compared to existing methods. Overall OVIGo-3DHSG demonstrates strong potential for applications requiring spatial reasoning and understanding of indoor environments. Related materials can be found at https://github.com/linukc/OVIGo-3DHSG.

Problem

Research questions and friction points this paper is trying to address.

Open-vocabulary object grounding in 3D indoor scenes

Modeling spatial relations via hierarchical scene graphs

Multistep reasoning for complex spatial queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses 3D Hierarchical Scene Graph

Integrates Large Language Model

Leverages open-vocabulary foundation models

🔎 Similar Papers

Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph