🤖 AI Summary
This work addresses two core challenges in embodied intelligence for continuous indoor tasks: (1) subtask infeasibility (e.g., target object absence) and (2) the need to reuse historical spatial knowledge. To this end, we propose an updateable spatial memory mechanism that explicitly incorporates 3D geometric information—marking the first integration of explicit geometric cues into multimodal large language models (MLLMs). Our method jointly leverages visual features, relational semantics, and geometric priors to enable sequential reasoning and adaptive exploration. Evaluated on the SEER-Bench benchmark, it achieves significant performance gains on sequential embodied question answering and multimodal navigation tasks. Results demonstrate its effectiveness for continual learning and robust reasoning in dynamic, complex indoor environments. The primary contributions are: (1) the first geometry-enhanced 3D spatial memory framework designed specifically for sequential embodied tasks, and (2) a breakthrough in overcoming the bottleneck of poor spatial knowledge transfer across tasks inherent in conventional single-task paradigms.
📝 Abstract
Existing research on indoor embodied tasks typically requires agents to actively explore unknown environments and reason about the scene to achieve a specific goal. However, when deployed in real life, agents often face sequential tasks, where each new sub-task follows the completion of the previous one, and certain sub-tasks may be infeasible, such as searching for a non-existent object. Compared with the single-task setting, the core challenge lies in reusing spatial knowledge accumulated from previous explorations to support subsequent reasoning and exploration. In this work, we investigate this underexplored yet practically significant embodied AI challenge. To evaluate this challenge, we introduce SEER-Bench, a new Sequential Embodied Exploration and Reasoning Benchmark encompassing encompassing two classic embodied tasks: Embodied Question Answering (EQA) and Embodied Multi-modal Navigation (EMN). Building on SEER-Bench, we propose 3DSPMR, a 3D SPatial Memory Reasoning approach that exploits relational, visual, and geometric cues from explored regions to augment Multi-Modal Large Language Models (MLLMs) for reasoning and exploration in sequential embodied tasks. To the best of our knowledge, this is the first work to explicitly incorporate geometric information into MLLM-based spatial understanding and reasoning. Extensive experiments verify that 3DSPMR achieves substantial performance gains on both sequential EQA and EMN tasks.