SPaRC: A Spatial Pathfinding Reasoning Challenge

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reasoning benchmarks exhibit saturation and poor performance on abstract, multi-step spatial and symbolic reasoning—such as path planning and geometric/arithmetic constraint satisfaction. Method: We introduce SPaRC, the first systematic benchmark for evaluating large language models’ (LLMs’) capabilities in 2D grid-based spatial path optimization and complex constraint satisfaction, comprising 1,000 hand-crafted puzzles emphasizing stepwise planning and rule-governed reasoning. Puzzles are generated via human-designed formal rules; evaluation integrates human baselines, path validity verification, and reasoning token tracing for diagnostic analysis. Contribution/Results: Human accuracy reaches 98.0% (94.5% on hard instances), while the strongest model, o4-mini, achieves only 15.8% (1.1% on hard instances), with over 50% of generated paths invalid. Multi-attempt decoding substantially improves performance. SPaRC uncovers fundamental deficiencies in LLMs’ navigational logic, spatial modeling, and adaptive computational depth—establishing a new paradigm for rigorous reasoning evaluation.

Technology Category

Application Category

📝 Abstract
Existing reasoning datasets saturate and fail to test abstract, multi-step problems, especially pathfinding and complex rule constraint satisfaction. We introduce SPaRC (Spatial Pathfinding Reasoning Challenge), a dataset of 1,000 2D grid pathfinding puzzles to evaluate spatial and symbolic reasoning, requiring step-by-step planning with arithmetic and geometric rules. Humans achieve near-perfect accuracy (98.0%; 94.5% on hard puzzles), while the best reasoning models, such as o4-mini, struggle (15.8%; 1.1% on hard puzzles). Models often generate invalid paths (>50% of puzzles for o4-mini), and reasoning tokens reveal they make errors in navigation and spatial logic. Unlike humans, who take longer on hard puzzles, models fail to scale test-time compute with difficulty. Allowing models to make multiple solution attempts improves accuracy, suggesting potential for better spatial reasoning with improved training and efficient test-time scaling methods. SPaRC can be used as a window into models' spatial reasoning limitations and drive research toward new methods that excel in abstract, multi-step problem-solving.
Problem

Research questions and friction points this paper is trying to address.

Evaluating spatial and symbolic reasoning in 2D grid pathfinding puzzles
Addressing model failures in navigation and spatial logic accuracy
Improving abstract multi-step problem-solving in AI models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SPaRC dataset for spatial reasoning evaluation
Requires step-by-step planning with arithmetic rules
Models struggle with navigation and spatial logic
🔎 Similar Papers
No similar papers found.