ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Traditional LLM-based visual navigation struggles to model spatial geometry and scene occupancy due to its reliance on textual representations. To address this, we propose a map-free embodied visual navigation framework that directly processes onboard RGB/RGB-D video streams and formulates navigation planning as a “scene-imagining-driven optimal viewpoint selection” problem. Our key contributions are: (1) a future-view imagination module enabling cross-temporal geometric reasoning; (2) a vision-language model (VLM) decision mechanism guided by visual prompts—bypassing the text bottleneck; (3) a selective foveal memory supporting sparse-to-dense long-term spatial modeling; and (4) a point-goal navigation decomposition framework. Evaluated on open-vocabulary object and instance navigation benchmarks under the map-free setting, our method achieves state-of-the-art performance—surpassing most map-based approaches.

Technology Category

Application Category

📝 Abstract

Visual navigation is a fundamental capability for autonomous home-assistance robots, enabling long-horizon tasks such as object search. While recent methods have leveraged Large Language Models (LLMs) to incorporate commonsense reasoning and improve exploration efficiency, their planning remains constrained by textual representations, which cannot adequately capture spatial occupancy or scene geometry--critical factors for navigation decisions. We explore whether Vision-Language Models (VLMs) can achieve mapless visual navigation using only onboard RGB/RGB-D streams, unlocking their potential for spatial perception and planning. We achieve this through an imagination-powered navigation framework, ImagineNav++, which imagines future observation images from candidate robot views and translates navigation planning into a simple best-view image selection problem for VLMs. First, a future-view imagination module distills human navigation preferences to generate semantically meaningful viewpoints with high exploration potential. These imagined views then serve as visual prompts for the VLM to identify the most informative viewpoint. To maintain spatial consistency, we develop a selective foveation memory mechanism, which hierarchically integrates keyframe observations via a sparse-to-dense framework, constructing a compact yet comprehensive memory for long-term spatial reasoning. This approach transforms goal-oriented navigation into a series of tractable point-goal navigation tasks. Extensive experiments on open-vocabulary object and instance navigation benchmarks show that ImagineNav++ achieves SOTA performance in mapless settings, even surpassing most map-based methods, highlighting the importance of scene imagination and memory in VLM-based spatial reasoning.

Problem

Research questions and friction points this paper is trying to address.

Enables mapless visual navigation using onboard RGB/RGB-D streams.

Transforms navigation planning into best-view image selection via scene imagination.

Enhances spatial reasoning with selective memory for long-term consistency.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Future-view imagination module generates semantically meaningful viewpoints

Selective foveation memory mechanism integrates keyframe observations hierarchically

Transforms navigation into best-view image selection for Vision-Language Models

🔎 Similar Papers

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

2024-07-09arXiv.orgCitations: 7

Advances in Embodied Navigation Using Large Language Models: A Survey

2023-11-01Citations: 8

Authors to Follow