🤖 AI Summary
Existing LLM/VLM-based embodied navigation methods face two key bottlenecks in long-horizon navigation: (1) language-based global memory representations—e.g., semantic or topological maps—suffer geometric information loss, impairing spatial reasoning; and (2) exclusive reliance on egocentric visual input leads to partial observability, resulting in suboptimal decisions. This paper proposes Global-to-Ego Adaptive VLM Navigation, the first framework introducing dynamic global–egocentric alignment. It integrates a structured map encoder with a real-time egocentric visual encoder, enabling geometry-aware, cross-modal attention for joint language–vision reasoning. By transcending limitations of pure textual descriptions and single-view modeling, our approach achieves new state-of-the-art performance on standard object navigation benchmarks: significantly higher success rates, improved path efficiency, and enhanced robustness and scalability in complex indoor environments.
📝 Abstract
Recent advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have made them powerful tools in embodied navigation, enabling agents to leverage commonsense and spatial reasoning for efficient exploration in unfamiliar environments. Existing LLM-based approaches convert global memory, such as semantic or topological maps, into language descriptions to guide navigation. While this improves efficiency and reduces redundant exploration, the loss of geometric information in language-based representations hinders spatial reasoning, especially in intricate environments. To address this, VLM-based approaches directly process ego-centric visual inputs to select optimal directions for exploration. However, relying solely on a first-person perspective makes navigation a partially observed decision-making problem, leading to suboptimal decisions in complex environments. In this paper, we present a novel vision-language model (VLM)-based navigation framework that addresses these challenges by adaptively retrieving task-relevant cues from a global memory module and integrating them with the agent's egocentric observations. By dynamically aligning global contextual information with local perception, our approach enhances spatial reasoning and decision-making in long-horizon tasks. Experimental results demonstrate that the proposed method surpasses previous state-of-the-art approaches in object navigation tasks, providing a more effective and scalable solution for embodied navigation.