An Extensive Evaluation of PDDL Capabilities in off-the-shelf LLMs

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This study systematically evaluates the zero-shot capability of 20 mainstream large language models (LLMs)—spanning seven major families, including both commercial and open-source variants—to understand, parse, and generate Planning Domain Definition Language (PDDL) for automated formalization of planning tasks. Method: We introduce the first standardized, multi-dimensional zero-shot PDDL benchmark, integrating syntactic structure analysis, semantic constraint modeling, logical consistency verification, plan executability testing, and expert human validation. Contribution/Results: While LLMs achieve moderate performance on simple PDDL tasks, their accuracy drops markedly on complex planning reasoning—such as causal chain derivation and constraint satisfaction—with a maximum accuracy of only 68.3%. This reveals fundamental deficiencies in formal logical reasoning and domain-specific knowledge representation. Our work establishes the first empirical baseline and capability map for leveraging LLMs in symbolic AI planning.

Technology Category

Application Category

📝 Abstract

In recent advancements, large language models (LLMs) have exhibited proficiency in code generation and chain-of-thought reasoning, laying the groundwork for tackling automatic formal planning tasks. This study evaluates the potential of LLMs to understand and generate Planning Domain Definition Language (PDDL), an essential representation in artificial intelligence planning. We conduct an extensive analysis across 20 distinct models spanning 7 major LLM families, both commercial and open-source. Our comprehensive evaluation sheds light on the zero-shot LLM capabilities of parsing, generating, and reasoning with PDDL. Our findings indicate that while some models demonstrate notable effectiveness in handling PDDL, others pose limitations in more complex scenarios requiring nuanced planning knowledge. These results highlight the promise and current limitations of LLMs in formal planning tasks, offering insights into their application and guiding future efforts in AI-driven planning paradigms.

Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs' PDDL understanding

Assess LLMs' PDDL generation

Analyze LLMs' planning reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs with PDDL

Analyzes 20 distinct models

Assesses zero-shot PDDL capabilities

🔎 Similar Papers

Automating the Generation of Prompts for LLM-based Action Choice in PDDL Planning