DyNaVLM: Zero-Shot Vision-Language Navigation System with Dynamic Viewpoints and Self-Refining Graph Memory

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the limitation of existing vision-language navigation (VLN) methods, which rely on task-specific training and exhibit poor generalization to continuous, real-world environments. To this end, we propose a zero-shot VLN framework that enables end-to-end multimodal reasoning without fine-tuning. Our method introduces a dynamic action space modeling mechanism, integrates cross-robot collaborative graph memory with retrieval-augmented self-correcting decision-making, and employs an executable topological relation graph for storage and distributed graph updates—supporting flexible goal selection and dynamic viewpoint planning. Key innovations include: (i) the first zero-training deployment paradigm for VLN; (ii) dynamic action space adaptation; and (iii) collaborative graph memory. The framework achieves zero-shot state-of-the-art performance on the GOAT and ObjectNav benchmarks and demonstrates strong generalization and robustness in real-world physical experiments.

Technology Category

Application Category

📝 Abstract
We present DyNaVLM, an end-to-end vision-language navigation framework using Vision-Language Models (VLM). In contrast to prior methods constrained by fixed angular or distance intervals, our system empowers agents to freely select navigation targets via visual-language reasoning. At its core lies a self-refining graph memory that 1) stores object locations as executable topological relations, 2) enables cross-robot memory sharing through distributed graph updates, and 3) enhances VLM's decision-making via retrieval augmentation. Operating without task-specific training or fine-tuning, DyNaVLM demonstrates high performance on GOAT and ObjectNav benchmarks. Real-world tests further validate its robustness and generalization. The system's three innovations: dynamic action space formulation, collaborative graph memory, and training-free deployment, establish a new paradigm for scalable embodied robot, bridging the gap between discrete VLN tasks and continuous real-world navigation.
Problem

Research questions and friction points this paper is trying to address.

Enables free navigation target selection via visual-language reasoning
Uses self-refining graph memory for object location storage
Bridges gap between discrete tasks and continuous real-world navigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic action space formulation for navigation
Collaborative graph memory for cross-robot sharing
Training-free deployment using Vision-Language Models
🔎 Similar Papers
No similar papers found.
Z
Zihe Ji
SJTU Paris Elite Institute of Technology, Shanghai Jiao Tong University, Shanghai, P.R. China
Huangxuan Lin
Huangxuan Lin
Shanghai Jiao Tong University
Y
Yue Gao
MoE Key Lab of Artificial Intelligence and AI Institute, Shanghai Jiao Tong University, P.R. China