🤖 AI Summary
Current vision-language models are constrained by limited context lengths, making it challenging to maintain temporally consistent spatial understanding without architectural modifications or fine-tuning. This work proposes a plug-and-play multi-agent framework that enables collaboration between local and global agents to construct a structured cognitive map as an external spatial memory. The approach requires no training or model architecture changes and introduces atomic map updates alongside cross-agent verification mechanisms, allowing seamless integration with any pretrained multimodal large language model. Experimental results demonstrate that the proposed framework significantly outperforms existing methods across multiple spatial reasoning benchmarks, effectively validating its efficacy and broad applicability.
📝 Abstract
Spatial intelligence is a key frontier for multimodal large language models (MLLMs), enabling them to reason about the physical world from visual experience. Inspired by human spatial cognition, recent approaches construct grid-based cognitive maps from multi-frame visual inputs to maintain coherent spatial representations over time. However, limited context lengths still challenge spatial understanding, while existing methods, such as long-context modeling and external memory, often require architectural changes, memory modules, or finetuning, limiting their applicability to off-the-shelf pretrained MLLMs. This motivates a lightweight, model-agnostic method for preserving spatial information beyond the native context window. To this end, we propose a plug-and-play multi-agent framework that collaboratively constructs cognitive maps as structured spatial memory, enhancing the spatial understanding of arbitrary pretrained MLLMs without architectural modification or additional training. Our framework features local-global agent coordination, cognitive map construction with atomic commits, and cross-agent verification. Extensive experiments demonstrate that our method achieves superior performance on spatial understanding tasks while remaining fully training-free. Code will be released.