Seemingly Simple Planning Problems are Computationally Challenging: The Countdown Game

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing planning benchmarks suffer from two critical limitations: ambiguous task definitions (e.g., travel planning) or excessive tailoring to weaknesses of classical planners (e.g., IPC domains), hindering objective evaluation of large language models’ long-horizon planning capabilities. To address this, we propose a novel planning benchmark grounded in the NP-complete Countdown game, where tasks are generated via natural-language arithmetic goals—ensuring formal verifiability, high computational complexity, and rich instance diversity. We are the first to systematically model Countdown as a planning evaluation task, design a scalable, memory-resistant dynamic instance generation method, and establish a theoretical complexity analysis framework based on the arithmetic operator space. Experiments demonstrate that state-of-the-art LLMs perform significantly below human-level accuracy, confirming the benchmark’s rigor and sensitivity, and exposing fundamental limitations of current AI in complex, multi-step deductive planning.

Technology Category

Application Category

📝 Abstract
There is a broad consensus that the inability to form long-term plans is one of the key limitations of current foundational models and agents. However, the existing planning benchmarks remain woefully inadequate to truly measure their planning capabilities. Most existing benchmarks either focus on loosely defined tasks like travel planning or end up leveraging existing domains and problems from international planning competitions. While the former tasks are hard to formalize and verify, the latter were specifically designed to test and challenge the weaknesses of existing automated planners. To address these shortcomings, we propose a procedure for creating a planning benchmark centered around the game called Countdown, where a player is expected to form a target number from a list of input numbers through arithmetic operations. We discuss how this problem meets many of the desiderata associated with an ideal benchmark for planning capabilities evaluation. Specifically, the domain allows for an intuitive, natural language description for each problem instance, it is computationally challenging (NP-complete), and the instance space is rich enough that we do not have to worry about memorization. We perform an extensive theoretical analysis, establishing the computational complexity result and demonstrate the advantage of our instance generation procedure over public benchmarks. We evaluate a variety of existing LLM-assisted planning methods on instances generated using our procedure. Our results show that, unlike other domains like 24 Game (a special case of Countdown), our proposed dynamic benchmark remains extremely challenging for existing LLM-based approaches.
Problem

Research questions and friction points this paper is trying to address.

Creating a planning benchmark using Countdown game
Evaluating computational complexity of planning problems
Assessing LLM-based planning methods on dynamic benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Countdown game as planning benchmark
Ensures NP-complete computational complexity
Dynamic benchmark challenges LLM-based approaches
🔎 Similar Papers
No similar papers found.