🤖 AI Summary
This work addresses the challenge of sustained reasoning about localization, environmental dynamics, and task progress in long-horizon mobile manipulation, where image observations alone are insufficient. The authors propose an online-updatable neural point map that jointly models the environment and robot embodiment as neural points within a shared latent space. By integrating object-level rigid-body tracking with forward kinematics, the method achieves an efficient spatiotemporal representation. The map is dynamically updated using first-person visual observations and proprioceptive states, providing multiscale, multi-view contextual information to vision–language–action policies. Evaluated on the BEHAVIOR-1K benchmark, the approach yields more direct trajectories, faster subgoal completion, and greater robustness to scene changes compared to image-only baselines, and demonstrates the ability to recover from failures such as object drops.
📝 Abstract
Long-horizon robot mobile manipulation requires continual reasoning about localization, environment changes, and task progress, all of which are challenging to infer from image observations alone. In this paper, we show that conditioning a mobile manipulation policy on a spatiotemporal feature map improves reasoning over long horizons. The map represents the environment and the articulated robot body as neural points in a shared latent space and is updated online from egocentric observations and proprioceptive state. We update the environment neural points using object-level rigid tracking and the robot neural points using forward kinematics. We use our spatiotemporal environment and robot feature (SERF) map as a state input to a vision-language-action (VLA) model by extracting map tokens from multiple reference frames and spatial scales, providing the policy with both local and global context. We demonstrate SERF on BEHAVIOR-1K, a benchmark for long-horizon mobile manipulation in household environments. Experiments show that the SERF VLA policy outperforms image-only baselines, reaches subgoals faster by following more direct trajectories, improves robustness to scene-configuration shifts, and recovers from object-drop failures.