🤖 AI Summary
Existing planning benchmarks suffer from two critical limitations: ambiguous task definitions (e.g., travel planning) or excessive tailoring to weaknesses of classical planners (e.g., IPC domains), hindering objective evaluation of large language models’ long-horizon planning capabilities. To address this, we propose a novel planning benchmark grounded in the NP-complete Countdown game, where tasks are generated via natural-language arithmetic goals—ensuring formal verifiability, high computational complexity, and rich instance diversity. We are the first to systematically model Countdown as a planning evaluation task, design a scalable, memory-resistant dynamic instance generation method, and establish a theoretical complexity analysis framework based on the arithmetic operator space. Experiments demonstrate that state-of-the-art LLMs perform significantly below human-level accuracy, confirming the benchmark’s rigor and sensitivity, and exposing fundamental limitations of current AI in complex, multi-step deductive planning.
📝 Abstract
There is a broad consensus that the inability to form long-term plans is one of the key limitations of current foundational models and agents. However, the existing planning benchmarks remain woefully inadequate to truly measure their planning capabilities. Most existing benchmarks either focus on loosely defined tasks like travel planning or end up leveraging existing domains and problems from international planning competitions. While the former tasks are hard to formalize and verify, the latter were specifically designed to test and challenge the weaknesses of existing automated planners. To address these shortcomings, we propose a procedure for creating a planning benchmark centered around the game called Countdown, where a player is expected to form a target number from a list of input numbers through arithmetic operations. We discuss how this problem meets many of the desiderata associated with an ideal benchmark for planning capabilities evaluation. Specifically, the domain allows for an intuitive, natural language description for each problem instance, it is computationally challenging (NP-complete), and the instance space is rich enough that we do not have to worry about memorization. We perform an extensive theoretical analysis, establishing the computational complexity result and demonstrate the advantage of our instance generation procedure over public benchmarks. We evaluate a variety of existing LLM-assisted planning methods on instances generated using our procedure. Our results show that, unlike other domains like 24 Game (a special case of Countdown), our proposed dynamic benchmark remains extremely challenging for existing LLM-based approaches.