3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

📅 2024-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited relational reasoning capability of embodied agents in 3D scenes—hindering natural language interaction—this paper proposes a learnable 3D semantic graph representation framework that explicitly models objects alongside their spatial, functional, and semantic relations. This structured graph is injected as input into large language models (LLMs) to enable vision-language joint reasoning. It marks the first effort to deeply integrate structured 3D scene graphs with LLMs, overcoming conventional limitations of coordinate- or point-cloud–based relational modeling. The methodology comprises four components: 3D scene graph construction, graph neural network–based representation learning, LLM instruction fine-tuning, and multimodal alignment prompt engineering. Evaluated on six benchmarks—including ScanRefer and ScanQA—the approach achieves average improvements of 5.2%–9.8% across referring localization, visual question answering, and description generation tasks, significantly outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
A 3D scene graph represents a compact scene model, storing information about the objects and the semantic relationships between them, making its use promising for robotic tasks. When interacting with a user, an embodied intelligent agent should be capable of responding to various queries about the scene formulated in natural language. Large Language Models (LLMs) are beneficial solutions for user-robot interaction due to their natural language understanding and reasoning abilities. Recent methods for creating learnable representations of 3D scenes have demonstrated the potential to improve the quality of LLMs responses by adapting to the 3D world. However, the existing methods do not explicitly utilize information about the semantic relationships between objects, limiting themselves to information about their coordinates. In this work, we propose a method 3DGraphLLM for constructing a learnable representation of a 3D scene graph. The learnable representation is used as input for LLMs to perform 3D vision-language tasks. In our experiments on popular ScanRefer, RIORefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap datasets, we demonstrate the advantage of this approach over baseline methods that do not use information about the semantic relationships between objects. The code is publicly available at https://github.com/CognitiveAISystems/3DGraphLLM.
Problem

Research questions and friction points this paper is trying to address.

3D Spatial Understanding
Object Relationships
Natural Language Processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

3DGraphLLM
Spatial Relationships
Natural Language Interaction
🔎 Similar Papers
No similar papers found.
Tatiana Zemskova
Tatiana Zemskova
IAI MIPT LLC
mathematical modelingneuroscience
D
Dmitry A. Yudin
Artificial Intelligence Research Institute, Moscow Institute of Physics and Technology