🤖 AI Summary
This paper addresses the limitation of existing vision-language navigation (VLN) methods, which rely on task-specific training and exhibit poor generalization to continuous, real-world environments. To this end, we propose a zero-shot VLN framework that enables end-to-end multimodal reasoning without fine-tuning. Our method introduces a dynamic action space modeling mechanism, integrates cross-robot collaborative graph memory with retrieval-augmented self-correcting decision-making, and employs an executable topological relation graph for storage and distributed graph updates—supporting flexible goal selection and dynamic viewpoint planning. Key innovations include: (i) the first zero-training deployment paradigm for VLN; (ii) dynamic action space adaptation; and (iii) collaborative graph memory. The framework achieves zero-shot state-of-the-art performance on the GOAT and ObjectNav benchmarks and demonstrates strong generalization and robustness in real-world physical experiments.
📝 Abstract
We present DyNaVLM, an end-to-end vision-language navigation framework using Vision-Language Models (VLM). In contrast to prior methods constrained by fixed angular or distance intervals, our system empowers agents to freely select navigation targets via visual-language reasoning. At its core lies a self-refining graph memory that 1) stores object locations as executable topological relations, 2) enables cross-robot memory sharing through distributed graph updates, and 3) enhances VLM's decision-making via retrieval augmentation. Operating without task-specific training or fine-tuning, DyNaVLM demonstrates high performance on GOAT and ObjectNav benchmarks. Real-world tests further validate its robustness and generalization. The system's three innovations: dynamic action space formulation, collaborative graph memory, and training-free deployment, establish a new paradigm for scalable embodied robot, bridging the gap between discrete VLN tasks and continuous real-world navigation.