Mem2Ego: Empowering Vision-Language Models with Global-to-Ego Memory for Long-Horizon Embodied Navigation

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM/VLM-based embodied navigation methods face two key bottlenecks in long-horizon navigation: (1) language-based global memory representations—e.g., semantic or topological maps—suffer geometric information loss, impairing spatial reasoning; and (2) exclusive reliance on egocentric visual input leads to partial observability, resulting in suboptimal decisions. This paper proposes Global-to-Ego Adaptive VLM Navigation, the first framework introducing dynamic global–egocentric alignment. It integrates a structured map encoder with a real-time egocentric visual encoder, enabling geometry-aware, cross-modal attention for joint language–vision reasoning. By transcending limitations of pure textual descriptions and single-view modeling, our approach achieves new state-of-the-art performance on standard object navigation benchmarks: significantly higher success rates, improved path efficiency, and enhanced robustness and scalability in complex indoor environments.

Technology Category

Application Category

📝 Abstract
Recent advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have made them powerful tools in embodied navigation, enabling agents to leverage commonsense and spatial reasoning for efficient exploration in unfamiliar environments. Existing LLM-based approaches convert global memory, such as semantic or topological maps, into language descriptions to guide navigation. While this improves efficiency and reduces redundant exploration, the loss of geometric information in language-based representations hinders spatial reasoning, especially in intricate environments. To address this, VLM-based approaches directly process ego-centric visual inputs to select optimal directions for exploration. However, relying solely on a first-person perspective makes navigation a partially observed decision-making problem, leading to suboptimal decisions in complex environments. In this paper, we present a novel vision-language model (VLM)-based navigation framework that addresses these challenges by adaptively retrieving task-relevant cues from a global memory module and integrating them with the agent's egocentric observations. By dynamically aligning global contextual information with local perception, our approach enhances spatial reasoning and decision-making in long-horizon tasks. Experimental results demonstrate that the proposed method surpasses previous state-of-the-art approaches in object navigation tasks, providing a more effective and scalable solution for embodied navigation.
Problem

Research questions and friction points this paper is trying to address.

Enhancing spatial reasoning in navigation
Integrating global memory with egocentric views
Improving long-horizon embodied navigation efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines global memory with ego-centric views
Enhances spatial reasoning in navigation
Improves decision-making in complex environments
🔎 Similar Papers
No similar papers found.
Lingfeng Zhang
Lingfeng Zhang
PhD student at Tsinghua University
embodied ai
Y
Yuecheng Liu
Huawei Noah’s Ark Lab
Z
Zhanguang Zhang
Huawei Noah’s Ark Lab
M
Matin Aghaei
Huawei Noah’s Ark Lab
Yaochen Hu
Yaochen Hu
Huawei Technologies Canada, University of Alberta
Large scale machine learningOptimizationRecommender systemsApproximation algorithmsStatistical machine learning
H
Hongjian Gu
Huawei Noah’s Ark Lab
Mohammad Ali Alomrani
Mohammad Ali Alomrani
University of Toronto
Machine Learning
D
David Gamaliel Arcos Bravo
Huawei Noah’s Ark Lab
R
Raika Karimi
Huawei Noah’s Ark Lab
A
Atia Hamidizadeh
Huawei Noah’s Ark Lab
H
Haoping Xu
Huawei Noah’s Ark Lab
G
Guowei Huang
Huawei Noah’s Ark Lab
Z
Zhanpeng Zhang
Huawei Noah’s Ark Lab
Tongtong Cao
Tongtong Cao
Researcher, Huawei Noah's Ark Lab
RoboticsEmbodied AIAutonomous driving
Weichao Qiu
Weichao Qiu
Researcher, Huawei
computer visionmachine learningcognitive science
X
Xingyue Quan
Huawei Noah’s Ark Lab
Jianye Hao
Jianye Hao
Huawei Noah's Ark Lab/Tianjin University
Multiagent SystemsEmbodied AI
Yuzheng Zhuang
Yuzheng Zhuang
Senior Researcher @ Huawei Noah's Ark Lab
Reinforcement LearningOptimizationAutonomous DrivingCommunication
Y
Yingxue Zhang
Huawei Noah’s Ark Lab