MAG-Nav: Language-Driven Object Navigation Leveraging Memory-Reserved Active Grounding

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenging problem of vision-language navigation in unknown, complex environments using only natural language instructions. Methodologically, we propose an end-to-end framework integrating active perception and historical memory: (i) a viewpoint-driven active localization mechanism—built upon off-the-shelf vision-language models (VLMs)—dynamically optimizes observation viewpoints; and (ii) a historical observation memory retrieval module mitigates vision-language ambiguity by selectively retaining and recalling past observations. Leveraging zero-shot inference, the framework achieves strong generalization to open-ended language commands. Our key contribution lies in departing from conventional passive perception paradigms: by enabling proactive viewpoint adjustment and selective memory retention, we significantly enhance cross-scene robustness of vision-language grounding. Extensive evaluation on the HM3D dataset demonstrates substantial improvements over state-of-the-art methods. Furthermore, real-world deployment on a quadruped robot validates both effectiveness and generalizability in physically complex, unstructured environments.

Technology Category

Application Category

📝 Abstract
Visual navigation in unknown environments based solely on natural language descriptions is a key capability for intelligent robots. In this work, we propose a navigation framework built upon off-the-shelf Visual Language Models (VLMs), enhanced with two human-inspired mechanisms: perspective-based active grounding, which dynamically adjusts the robot's viewpoint for improved visual inspection, and historical memory backtracking, which enables the system to retain and re-evaluate uncertain observations over time. Unlike existing approaches that passively rely on incidental visual inputs, our method actively optimizes perception and leverages memory to resolve ambiguity, significantly improving vision-language grounding in complex, unseen environments. Our framework operates in a zero-shot manner, achieving strong generalization to diverse and open-ended language descriptions without requiring labeled data or model fine-tuning. Experimental results on Habitat-Matterport 3D (HM3D) show that our method outperforms state-of-the-art approaches in language-driven object navigation. We further demonstrate its practicality through real-world deployment on a quadruped robot, achieving robust and effective navigation performance.
Problem

Research questions and friction points this paper is trying to address.

Enables robots to navigate using natural language descriptions
Improves visual grounding via active viewpoint adjustment
Enhances navigation accuracy with memory-based ambiguity resolution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses perspective-based active grounding
Implements historical memory backtracking
Operates in zero-shot manner
🔎 Similar Papers
No similar papers found.