🤖 AI Summary
Large language models (LLMs) exhibit significant deficiencies in geometric and spatial reasoning—particularly in structured understanding of indoor environments. Method: We introduce PlanQA, the first benchmark featuring symbolic (JSON/XML) representations of indoor layouts, with diagnostic question-answering tasks spanning distance estimation, visibility reasoning, path planning, and furniture plausibility verification to systematically assess models’ grasp of topological relations, metric constraints, and design principles. Contribution/Results: Experiments reveal that while mainstream LLMs perform reasonably on superficial queries, they suffer from systematic failures in modeling physical constraints, maintaining spatial consistency, and generalizing under layout perturbations. PlanQA provides a reproducible, modular evaluation framework to expose real-world spatial reasoning blind spots and advance embodied intelligence and spatial semantic modeling.
📝 Abstract
We introduce PlanQA, a diagnostic benchmark for evaluating geometric and spatial reasoning in large-language models (LLMs). PlanQA is grounded in structured representations of indoor scenes, such as kitchens, living rooms, and bedrooms, encoded in a symbolic format (e.g., JSON, XML layouts). The benchmark includes diverse question types that test not only metric and topological reasoning (e.g., distance, visibility, shortest paths) but also interior design constraints such as affordance, clearance, balance, and usability. Our results across a variety of frontier open-source and commercial LLMs show that while models may succeed in shallow queries, they often fail to simulate physical constraints, preserve spatial coherence, or generalize under layout perturbation. PlanQA uncovers a clear blind spot in today's LLMs: they do not consistently reason about real-world layouts. We hope that this benchmark inspires new work on language models that can accurately infer and manipulate spatial and geometric properties in practical settings.