🤖 AI Summary
This work addresses the performance degradation of imitation learning in large-scale, partially observable environments, where agents often lack sufficient long-term spatiotemporal context. To mitigate this limitation, the authors propose a structured memory mechanism based on dynamic scene graphs. This approach introduces dynamic scene graphs into imitation learning for the first time, explicitly modeling spatiotemporal context by continuously recording object-centric relational structures and their temporal evolution. By doing so, it enhances the agent’s capacity to reason over historical and spatial information. Experimental results demonstrate that the proposed architecture significantly improves policy performance in both simulated mobile manipulation tasks and real-world tabletop scenarios, with particularly strong gains in settings demanding long-horizon reasoning and robust generalization.
📝 Abstract
Imitation learning enables robots to learn how to execute tasks via observation. However, real-world environments like homes and offices are often severely partially observed due to their large spatial scales. In addition, many tasks involve executing a series of subtasks requiring autonomous robots to reason over extended time horizons. To address these challenges, we propose using scene graphs as an explicit and structured memory mechanism in imitation learning. By maintaining a dynamic scene graph that captures object-centric relationships and their evolution over time, our method allows the agent to retain relevant historical context during task execution to efficiently reason over incrementally accrued scene information. Our experiments on simulated mobile manipulation and real-world tabletop manipulation demonstrate that our approach substantially improves policy performance, particularly in settings that demand long-term reasoning and robust generalization under partial observability.