🤖 AI Summary
To address the poor generalization of vision-language navigation (VLN) agents to unseen environments, this paper proposes the first hybrid memory system integrating real-world perception and imagination generation, inspired by human episodic simulation and memory mechanisms. Methodologically: (1) we design an updateable and generative episodic memory architecture, introducing episodic simulation into VLN for the first time; (2) we build a Transformer-based memory encoder and a cross-modal imagination decoder to enable high-fidelity RGB scene imagination; (3) we introduce self-supervised pretraining tasks—imagined reconstruction and temporal consistency—to enhance memory grounding and dynamics. Evaluated on mainstream VLN benchmarks, our approach achieves state-of-the-art SPL scores, with significant improvements in navigation success rate and path efficiency within unseen environments. These results empirically validate that imagination-augmented memory substantially enhances embodied navigation generalization.
📝 Abstract
Humans navigate unfamiliar environments using episodic simulation and episodic memory, which facilitate a deeper understanding of the complex relationships between environments and objects. Developing an imaginative memory system inspired by human mechanisms can enhance the navigation performance of embodied agents in unseen environments. However, existing Vision-and-Language Navigation (VLN) agents lack a memory mechanism of this kind. To address this, we propose a novel architecture that equips agents with a reality-imagination hybrid memory system. This system enables agents to maintain and expand their memory through both imaginative mechanisms and navigation actions. Additionally, we design tailored pre-training tasks to develop the agent's imaginative capabilities. Our agent can imagine high-fidelity RGB images for future scenes, achieving state-of-the-art result in Success rate weighted by Path Length (SPL).