PlanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Large language models (LLMs) exhibit significant deficiencies in geometric and spatial reasoning—particularly in structured understanding of indoor environments. Method: We introduce PlanQA, the first benchmark featuring symbolic (JSON/XML) representations of indoor layouts, with diagnostic question-answering tasks spanning distance estimation, visibility reasoning, path planning, and furniture plausibility verification to systematically assess models’ grasp of topological relations, metric constraints, and design principles. Contribution/Results: Experiments reveal that while mainstream LLMs perform reasonably on superficial queries, they suffer from systematic failures in modeling physical constraints, maintaining spatial consistency, and generalizing under layout perturbations. PlanQA provides a reproducible, modular evaluation framework to expose real-world spatial reasoning blind spots and advance embodied intelligence and spatial semantic modeling.

Technology Category

Application Category

📝 Abstract

We introduce PlanQA, a diagnostic benchmark for evaluating geometric and spatial reasoning in large-language models (LLMs). PlanQA is grounded in structured representations of indoor scenes, such as kitchens, living rooms, and bedrooms, encoded in a symbolic format (e.g., JSON, XML layouts). The benchmark includes diverse question types that test not only metric and topological reasoning (e.g., distance, visibility, shortest paths) but also interior design constraints such as affordance, clearance, balance, and usability. Our results across a variety of frontier open-source and commercial LLMs show that while models may succeed in shallow queries, they often fail to simulate physical constraints, preserve spatial coherence, or generalize under layout perturbation. PlanQA uncovers a clear blind spot in today's LLMs: they do not consistently reason about real-world layouts. We hope that this benchmark inspires new work on language models that can accurately infer and manipulate spatial and geometric properties in practical settings.

Problem

Research questions and friction points this paper is trying to address.

Evaluating geometric and spatial reasoning in LLMs using structured indoor scene representations

Testing LLMs on metric, topological reasoning, and interior design constraints

Identifying LLMs' failure to simulate physical constraints and preserve spatial coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured symbolic scene representations for spatial reasoning

Diverse question types testing geometric and design constraints

Benchmark revealing LLMs' spatial reasoning blind spots

🔎 Similar Papers

Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning