Mem2Ego: Empowering Vision-Language Models with Global-to-Ego Memory for Long-Horizon Embodied Navigation

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Existing LLM/VLM-based embodied navigation methods face two key bottlenecks in long-horizon navigation: (1) language-based global memory representations—e.g., semantic or topological maps—suffer geometric information loss, impairing spatial reasoning; and (2) exclusive reliance on egocentric visual input leads to partial observability, resulting in suboptimal decisions. This paper proposes Global-to-Ego Adaptive VLM Navigation, the first framework introducing dynamic global–egocentric alignment. It integrates a structured map encoder with a real-time egocentric visual encoder, enabling geometry-aware, cross-modal attention for joint language–vision reasoning. By transcending limitations of pure textual descriptions and single-view modeling, our approach achieves new state-of-the-art performance on standard object navigation benchmarks: significantly higher success rates, improved path efficiency, and enhanced robustness and scalability in complex indoor environments.

Technology Category

Application Category

📝 Abstract

Recent advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have made them powerful tools in embodied navigation, enabling agents to leverage commonsense and spatial reasoning for efficient exploration in unfamiliar environments. Existing LLM-based approaches convert global memory, such as semantic or topological maps, into language descriptions to guide navigation. While this improves efficiency and reduces redundant exploration, the loss of geometric information in language-based representations hinders spatial reasoning, especially in intricate environments. To address this, VLM-based approaches directly process ego-centric visual inputs to select optimal directions for exploration. However, relying solely on a first-person perspective makes navigation a partially observed decision-making problem, leading to suboptimal decisions in complex environments. In this paper, we present a novel vision-language model (VLM)-based navigation framework that addresses these challenges by adaptively retrieving task-relevant cues from a global memory module and integrating them with the agent's egocentric observations. By dynamically aligning global contextual information with local perception, our approach enhances spatial reasoning and decision-making in long-horizon tasks. Experimental results demonstrate that the proposed method surpasses previous state-of-the-art approaches in object navigation tasks, providing a more effective and scalable solution for embodied navigation.

Problem

Research questions and friction points this paper is trying to address.

Enhancing spatial reasoning in navigation

Integrating global memory with egocentric views

Improving long-horizon embodied navigation efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines global memory with ego-centric views

Enhances spatial reasoning in navigation

Improves decision-making in complex environments

🔎 Similar Papers

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models