🤖 AI Summary
This work addresses the challenging problem of vision-language navigation in unknown, complex environments using only natural language instructions. Methodologically, we propose an end-to-end framework integrating active perception and historical memory: (i) a viewpoint-driven active localization mechanism—built upon off-the-shelf vision-language models (VLMs)—dynamically optimizes observation viewpoints; and (ii) a historical observation memory retrieval module mitigates vision-language ambiguity by selectively retaining and recalling past observations. Leveraging zero-shot inference, the framework achieves strong generalization to open-ended language commands. Our key contribution lies in departing from conventional passive perception paradigms: by enabling proactive viewpoint adjustment and selective memory retention, we significantly enhance cross-scene robustness of vision-language grounding. Extensive evaluation on the HM3D dataset demonstrates substantial improvements over state-of-the-art methods. Furthermore, real-world deployment on a quadruped robot validates both effectiveness and generalizability in physically complex, unstructured environments.
📝 Abstract
Visual navigation in unknown environments based solely on natural language descriptions is a key capability for intelligent robots. In this work, we propose a navigation framework built upon off-the-shelf Visual Language Models (VLMs), enhanced with two human-inspired mechanisms: perspective-based active grounding, which dynamically adjusts the robot's viewpoint for improved visual inspection, and historical memory backtracking, which enables the system to retain and re-evaluate uncertain observations over time. Unlike existing approaches that passively rely on incidental visual inputs, our method actively optimizes perception and leverages memory to resolve ambiguity, significantly improving vision-language grounding in complex, unseen environments. Our framework operates in a zero-shot manner, achieving strong generalization to diverse and open-ended language descriptions without requiring labeled data or model fine-tuning. Experimental results on Habitat-Matterport 3D (HM3D) show that our method outperforms state-of-the-art approaches in language-driven object navigation. We further demonstrate its practicality through real-world deployment on a quadruped robot, achieving robust and effective navigation performance.