SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning

📅 2026-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of existing vision-and-language navigation methods, which often rely on local visual cues and linear historical reasoning, thereby struggling to model global spatial relationships among regions, paths, and landmarks in unseen environments. To overcome this, the authors propose a zero-shot navigation agent grounded in spatial cognitive memory and task-guided spatial chain-of-thought (Spatial-CoT) reasoning. Within a staged closed-loop framework, the agent dynamically abstracts explored areas into spatial waypoints, maintains landmark evidence relevant to subtasks, and achieves hierarchical understanding of both navigation progress and environmental structure. The approach unifies vision-and-language navigation and object-goal navigation without requiring task-specific training, achieving state-of-the-art zero-shot performance across multiple benchmarks—including R2R-CE, RxR-CE, GN-Bench, and HM3D-OVON—and demonstrating practical efficacy in real-world robotic systems.

📝 Abstract

Vision-and-Language Navigation in continuous environments requires agents to understand the spatial structure of previously unseen environments in order to follow language instructions. Although foundation models have opened a promising path toward zero-shot navigation without task-specific policy training, many navigators still rely on local visual cues and linear history-based reasoning, overlooking the spatial nature of navigation across explored regions, traversed paths, landmarks, and their spatial relations. In this paper, we propose SpaceVLN, a navigation agent built around Spatial Cognitive Memory and Task-Guided Spatial Reasoning. Specifically, SpaceVLN introduces an efficient stagewise closed-loop framework where planning and execution are organized around verifiable space--landmark stages. During navigation, the agent progressively abstracts explored regions into Spatial Waypoints and dynamically maintains subtask-grounded landmark evidence, forming a hierarchical Spatial Cognitive Memory for progress localization and spatial-relation understanding. Built on this memory, Spatial-CoT integrates task-progress reasoning with spatial perception, analysis, and prediction, enabling Task-Guided Spatial Reasoning for embodied navigation. The unified stage interface enables SpaceVLN to address both Vision-and-Language Navigation and Object-Goal Navigation under a unified zero-shot setting, without task-specific policy training. Across R2R-CE, RxR-CE, GN-Bench, and HM3D-OVON, SpaceVLN achieves state-of-the-art zero-shot performance, and real-robot deployment further validates its applicability. These results highlight Spatial Cognitive Memory and Task-Guided Spatial Reasoning as a practical foundation for stronger embodied navigation agents.

Problem

Research questions and friction points this paper is trying to address.

Vision-and-Language Navigation

Spatial Reasoning

Zero-Shot Navigation

Embodied AI

Cognitive Memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial Cognitive Memory

Task-Guided Spatial Reasoning

Zero-Shot Navigation