Can AI Reason Like an Urban Planner? Benchmarking Large Language Models Against Professional Judgment

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses whether large language models (LLMs) possess core reasoning capabilities—such as contextual sensitivity, value awareness, and institutional literacy—required for urban planning. To this end, it introduces Urban Planning Bench (UPBench), the first domain-specific evaluation framework for this field, integrating four knowledge pillars and five cognitive levels, and combining automated scoring with expert review to systematically assess 25 LLMs. The analysis reveals four key cognitive limitations: “regulatory hallucination,” “conceptual confusion,” “complexity paralysis,” and “lack of practical wisdom.” Notably, LLMs outperform in higher-order analytical tasks compared to factual recall, yielding a non-monotonic cognitive performance curve. While LLMs show promise in interdisciplinary synthesis, scenario generation, and preliminary policy analysis, they remain unreliable in interpreting local regulations, resolving normative conflicts, and making procedural judgments.

📝 Abstract

Problem, Research Strategy, and Findings: The rise of large language models (LLMs) raises a key question for urban planning: which forms of professional planning knowledge can AI replicate, and which still require human judgment? Although AI tools are increasingly used in planning practice, there is still no systematic framework for testing whether they can reason with the contextual sensitivity, value awareness, and institutional literacy central to planning expertise. This paper introduces Urban Planning Bench (UPBench), a domain-specific evaluation framework that assesses LLM reasoning through a 4x5 matrix of four knowledge pillars and five cognitive levels adapted from Bloom's revised taxonomy. Evaluating 25 LLMs with automated scoring and expert review, we find a non-monotonic cognitive curve: models perform better on higher-order analytical tasks than on factual recall and integrative judgment. This suggests that planning knowledge often treated as lower-order is deeply shaped by institutional, jurisdictional, and temporal context, making it hard for LLMs to generalize. We summarize these limits as four epistemic diagnostics: regulatory hallucination, conceptual conflation, wickedness paralysis, and phronetic deficit. Takeaway for Practice: The findings support differential delegation in planning. LLMs can assist with cross-disciplinary synthesis, literature review, scenario generation, and preliminary policy analysis. However, they remain unreliable for jurisdiction-specific regulation, normative conflict resolution, and context-sensitive procedure. Agencies should require verification for AI-assisted regulatory analysis, while planning education should emphasize institutional literacy, normative judgment, and contextual sensitivity.

Problem

Research questions and friction points this paper is trying to address.

urban planning

large language models

professional judgment

AI reasoning

contextual sensitivity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Urban Planning Bench

large language models

professional judgment