🤖 AI Summary
This study systematically evaluates the zero-shot capability of 20 mainstream large language models (LLMs)—spanning seven major families, including both commercial and open-source variants—to understand, parse, and generate Planning Domain Definition Language (PDDL) for automated formalization of planning tasks. Method: We introduce the first standardized, multi-dimensional zero-shot PDDL benchmark, integrating syntactic structure analysis, semantic constraint modeling, logical consistency verification, plan executability testing, and expert human validation. Contribution/Results: While LLMs achieve moderate performance on simple PDDL tasks, their accuracy drops markedly on complex planning reasoning—such as causal chain derivation and constraint satisfaction—with a maximum accuracy of only 68.3%. This reveals fundamental deficiencies in formal logical reasoning and domain-specific knowledge representation. Our work establishes the first empirical baseline and capability map for leveraging LLMs in symbolic AI planning.
📝 Abstract
In recent advancements, large language models (LLMs) have exhibited proficiency in code generation and chain-of-thought reasoning, laying the groundwork for tackling automatic formal planning tasks. This study evaluates the potential of LLMs to understand and generate Planning Domain Definition Language (PDDL), an essential representation in artificial intelligence planning. We conduct an extensive analysis across 20 distinct models spanning 7 major LLM families, both commercial and open-source. Our comprehensive evaluation sheds light on the zero-shot LLM capabilities of parsing, generating, and reasoning with PDDL. Our findings indicate that while some models demonstrate notable effectiveness in handling PDDL, others pose limitations in more complex scenarios requiring nuanced planning knowledge. These results highlight the promise and current limitations of LLMs in formal planning tasks, offering insights into their application and guiding future efforts in AI-driven planning paradigms.