GraphPad: Inference-Time 3D Scene Graph Updates for Embodied Question Answering

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing static 3D scene representations struggle to adapt to dynamic task requirements, often omitting critical objects, spatial relations, and fine-grained details. To address this, we propose the first inference-time, language-driven online 3D scene graph editing framework, enabling embodied agents to dynamically insert, delete, or annotate objects, spatial relations, and task-specific notes via natural language instructions—achieving real-time alignment between scene understanding and task goals. Our method employs a three-module architecture: (1) a mutable graph neural network to model structural evolution; (2) frame-level navigation logs for temporal consistency; and (3) a task-oriented draft zone for semantic buffering—integrated with a vision-language model for cross-modal semantic parsing and graph operations. Evaluated on OpenEQA, our approach achieves 55.3% accuracy—surpassing image-based baselines by 3.0%—while using only 1/5 of the input frames. This demonstrates the efficacy of lightweight, training-free, task-adaptive structured dynamic memory.

Technology Category

Application Category

📝 Abstract

Structured scene representations are a core component of embodied agents, helping to consolidate raw sensory streams into readable, modular, and searchable formats. Due to their high computational overhead, many approaches build such representations in advance of the task. However, when the task specifications change, such static approaches become inadequate as they may miss key objects, spatial relations, and details. We introduce GraphPad, a modifiable structured memory that an agent can tailor to the needs of the task through API calls. It comprises a mutable scene graph representing the environment, a navigation log indexing frame-by-frame content, and a scratchpad for task-specific notes. Together, GraphPad serves as a dynamic workspace that remains complete, current, and aligned with the agent's immediate understanding of the scene and its task. On the OpenEQA benchmark, GraphPad attains 55.3%, a +3.0% increase over an image-only baseline using the same vision-language model, while operating with five times fewer input frames. These results show that allowing online, language-driven refinement of 3-D memory yields more informative representations without extra training or data collection.

Problem

Research questions and friction points this paper is trying to address.

Dynamic scene graph updates for real-time task adaptation

Reducing computational overhead in embodied question answering

Enhancing 3D memory with language-driven refinement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modifiable structured memory via API calls

Mutable scene graph for dynamic environment representation

Language-driven refinement of 3D memory

🔎 Similar Papers

No similar papers found.