SPaRC: A Spatial Pathfinding Reasoning Challenge

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing reasoning benchmarks exhibit saturation and poor performance on abstract, multi-step spatial and symbolic reasoning—such as path planning and geometric/arithmetic constraint satisfaction. Method: We introduce SPaRC, the first systematic benchmark for evaluating large language models’ (LLMs’) capabilities in 2D grid-based spatial path optimization and complex constraint satisfaction, comprising 1,000 hand-crafted puzzles emphasizing stepwise planning and rule-governed reasoning. Puzzles are generated via human-designed formal rules; evaluation integrates human baselines, path validity verification, and reasoning token tracing for diagnostic analysis. Contribution/Results: Human accuracy reaches 98.0% (94.5% on hard instances), while the strongest model, o4-mini, achieves only 15.8% (1.1% on hard instances), with over 50% of generated paths invalid. Multi-attempt decoding substantially improves performance. SPaRC uncovers fundamental deficiencies in LLMs’ navigational logic, spatial modeling, and adaptive computational depth—establishing a new paradigm for rigorous reasoning evaluation.

Technology Category

Application Category

📝 Abstract

Existing reasoning datasets saturate and fail to test abstract, multi-step problems, especially pathfinding and complex rule constraint satisfaction. We introduce SPaRC (Spatial Pathfinding Reasoning Challenge), a dataset of 1,000 2D grid pathfinding puzzles to evaluate spatial and symbolic reasoning, requiring step-by-step planning with arithmetic and geometric rules. Humans achieve near-perfect accuracy (98.0%; 94.5% on hard puzzles), while the best reasoning models, such as o4-mini, struggle (15.8%; 1.1% on hard puzzles). Models often generate invalid paths (>50% of puzzles for o4-mini), and reasoning tokens reveal they make errors in navigation and spatial logic. Unlike humans, who take longer on hard puzzles, models fail to scale test-time compute with difficulty. Allowing models to make multiple solution attempts improves accuracy, suggesting potential for better spatial reasoning with improved training and efficient test-time scaling methods. SPaRC can be used as a window into models' spatial reasoning limitations and drive research toward new methods that excel in abstract, multi-step problem-solving.

Problem

Research questions and friction points this paper is trying to address.

Evaluating spatial and symbolic reasoning in 2D grid pathfinding puzzles

Addressing model failures in navigation and spatial logic accuracy

Improving abstract multi-step problem-solving in AI models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SPaRC dataset for spatial reasoning evaluation

Requires step-by-step planning with arithmetic rules

Models struggle with navigation and spatial logic

🔎 Similar Papers

Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning