π€ AI Summary
To address task interruption in end-to-end robotic control caused by transient occlusion or out-of-view object displacement, this paper proposes MindMapβa novel framework that pioneers the deep integration of semantic-level 3D environment reconstruction with diffusion-based policy networks to realize long-term spatial memory in 3D action policies. Methodologically, MindMap models spatial memory via depth-aware feature maps, unifying real-time semantic 3D reconstruction and diffusion-driven 3D action generation within a single end-to-end trainable architecture that jointly optimizes perception, memory, and decision-making. Experiments across diverse simulated manipulation tasks requiring spatial memory demonstrate that MindMap significantly outperforms memory-less baselines and state-of-the-art methods. To foster reproducibility and community advancement, we publicly release the 3D reconstruction system, training code, and evaluation benchmark.
π Abstract
End-to-end learning of robot control policies, structured as neural networks, has emerged as a promising approach to robotic manipulation. To complete many common tasks, relevant objects are required to pass in and out of a robot's field of view. In these settings, spatial memory - the ability to remember the spatial composition of the scene - is an important competency. However, building such mechanisms into robot learning systems remains an open research problem. We introduce mindmap (Spatial Memory in Deep Feature Maps for 3D Action Policies), a 3D diffusion policy that generates robot trajectories based on a semantic 3D reconstruction of the environment. We show in simulation experiments that our approach is effective at solving tasks where state-of-the-art approaches without memory mechanisms struggle. We release our reconstruction system, training code, and evaluation tasks to spur research in this direction.