AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models

📅 2026-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge faced by multimodal foundation models in translating egocentric observations into allocentric spatial representations for physical reasoning. The authors propose AlloSpatial, a novel framework that introduces, for the first time, a plug-and-play World2Mind cognitive mapping sandbox and a spatial reasoning Harness mechanism. The sandbox constructs structured allocentric priors—such as Allocentric Spatial Trees (AST) and path graphs—while the Harness enables robust reasoning through tool-call judgments, modality-disentangled cue gathering, and geometry–semantics arbitration, further internalizing capabilities via cold-start reinforcement learning. Evaluated on VSI-Bench and MindCube, the method boosts closed-source models’ performance by 5%–18% without any training; remarkably, AST alone supports strong spatial reasoning even without visual input, and after training, the approach surpasses larger general-purpose models and state-of-the-art spatial reasoning baselines.

📝 Abstract

Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning over the physical world. A key bottleneck lies in their inability to transform local egocentric observations into a global allocentric spatial representation. To address this, we propose AlloSpatial, an agentic framework for allocentric spatial cognition in foundation models. AlloSpatial introduces World2Mind, a plug-and-play cognitive mapping sandbox that converts egocentric observations into structured allocentric priors, including Allocentric-Spatial Trees and route maps that support querying object topology, geometric relations, passability, and trajectories. To utilize these priors reliably under noisy reconstruction and ambiguous visual evidence, AlloSpatial introduces a Spatial Reasoning Harness for tool-use judgment, modality-decoupled cue collection, and geometry-semantic arbitration. We further internalize this process in Qwen3-VL through cold-start reinforcement learning with a harness-gated trajectory-level reward. Experiments on VSI-Bench and MindCube show that AlloSpatial improves proprietary models by 5%-18% in a training-free setting, while ASTs alone support strong spatial reasoning even when visual inputs are removed. The trained AlloSpatial agents further outperform larger general-purpose models and competitive spatial baselines, suggesting that structured allocentric representations, active tool use, and verifiable reasoning offer a promising route toward spatially capable foundation models.

Problem

Research questions and friction points this paper is trying to address.

spatial reasoning

allocentric representation

egocentric observations

multimodal foundation models

cognitive mapping

Innovation

Methods, ideas, or system contributions that make the work stand out.

allocentric spatial representation

spatial reasoning harness

cognitive mapping