DyNaVLM: Zero-Shot Vision-Language Navigation System with Dynamic Viewpoints and Self-Refining Graph Memory

📅 2025-06-18

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper addresses the limitation of existing vision-language navigation (VLN) methods, which rely on task-specific training and exhibit poor generalization to continuous, real-world environments. To this end, we propose a zero-shot VLN framework that enables end-to-end multimodal reasoning without fine-tuning. Our method introduces a dynamic action space modeling mechanism, integrates cross-robot collaborative graph memory with retrieval-augmented self-correcting decision-making, and employs an executable topological relation graph for storage and distributed graph updates—supporting flexible goal selection and dynamic viewpoint planning. Key innovations include: (i) the first zero-training deployment paradigm for VLN; (ii) dynamic action space adaptation; and (iii) collaborative graph memory. The framework achieves zero-shot state-of-the-art performance on the GOAT and ObjectNav benchmarks and demonstrates strong generalization and robustness in real-world physical experiments.

Technology Category

Application Category

📝 Abstract

We present DyNaVLM, an end-to-end vision-language navigation framework using Vision-Language Models (VLM). In contrast to prior methods constrained by fixed angular or distance intervals, our system empowers agents to freely select navigation targets via visual-language reasoning. At its core lies a self-refining graph memory that 1) stores object locations as executable topological relations, 2) enables cross-robot memory sharing through distributed graph updates, and 3) enhances VLM's decision-making via retrieval augmentation. Operating without task-specific training or fine-tuning, DyNaVLM demonstrates high performance on GOAT and ObjectNav benchmarks. Real-world tests further validate its robustness and generalization. The system's three innovations: dynamic action space formulation, collaborative graph memory, and training-free deployment, establish a new paradigm for scalable embodied robot, bridging the gap between discrete VLN tasks and continuous real-world navigation.

Problem

Research questions and friction points this paper is trying to address.

Enables free navigation target selection via visual-language reasoning

Uses self-refining graph memory for object location storage

Bridges gap between discrete tasks and continuous real-world navigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic action space formulation for navigation

Collaborative graph memory for cross-robot sharing

Training-free deployment using Vision-Language Models

🔎 Similar Papers

No similar papers found.

Authors to Follow