Reasmory: 3D Reconstruction as Explicit Memory for VLMs Spatial Reasoning

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing vision-language models struggle with spatial reasoning tasks such as perspective inference, directional comparison, and distance estimation, often failing to effectively integrate sparse spatial cues from multi-view images or videos. This work proposes the first approach that incorporates explicit 3D reconstruction as a memory mechanism within vision-language models, leveraging semantically grounded 3D object instances. To enable structured spatial reasoning, the authors design a lightweight domain-specific language (DSL) that guides the model in performing spatial queries, viewpoint transformations, and rendering operations. Crucially, instead of relying on unconstrained tool invocation, the method validates generated programs for syntactic and semantic correctness prior to execution, substantially improving reasoning accuracy and robustness. Evaluated on benchmark datasets for spatial reasoning in multi-view imagery and video, the proposed approach outperforms strong baselines—including GPT-5-mini and Gemini-3-flash—by 6% to 18%.

📝 Abstract

Vision-Language Models (VLMs) exhibit emerging spatial reasoning capabilities, yet they remain unreliable on tasks requiring precise spatial understanding, such as viewpoint reasoning, directional comparison, and distance estimation. In multi-view images and monocular videos, relevant spatial cues are often sparse and distributed across redundant observations, making them difficult to organize and exploit. Reconstruction-based Vision Foundation Models (VFMs) offer a natural way to aggregate such observations into explicit spatial memory, such as point clouds. However, simply exposing reconstruction models as free-form tools is brittle, VLMs may invoke tools incorrectly, skip required spatial transformations, or misuse intermediate results. We propose \textbf{Reasmory}, a framework that formulates spatial reasoning as structured program execution over reconstructed spatial memory. Reasmory constructs explicit 3D memory, augments it with semantically grounded 3D object instances, and introduces a lightweight Domain-Specific Language (DSL) that constrains how VLMs query objects and cameras, transform viewpoints, and render observations during reasoning. Generated programs are parsed and validated before execution, enabling more reliable interaction with spatial memory than unconstrained tool use. Experiments on multi-view image and video spatial reasoning benchmarks show consistent gains of 6--18\% over strong baselines, including GPT-5-mini and Gemini-3-flash, indicating that explicit 3D memory is most useful when accessed through constrained, validated operations rather than free-form tool calls.

Problem

Research questions and friction points this paper is trying to address.

spatial reasoning

Vision-Language Models

3D reconstruction

explicit memory

multi-view images

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D reconstruction

spatial reasoning

Vision-Language Models