🤖 AI Summary
This work addresses the challenge of generating provably correct generalized policies—Python programs with formal correctness guarantees—as executable strategies within PDDL-defined world models, without requiring external verifiers. We propose LMPlan, a framework that leverages prompt engineering to guide language models (LMs) to directly synthesize formally verifiable policy programs, tightly integrating PDDL domain modeling with built-in formal correctness assurance mechanisms. Our key contribution is the first demonstration of strictly provably correct, programmatic policy generation that requires no external verification. Empirically, we find LMs achieve superior performance when processing symbolic PDDL inputs—a result that challenges the conventional assumption that LM success relies primarily on semantic understanding and memorization from training data. Under fixed computational resources, LMPlan significantly outperforms both classical PDDL planners and state-of-the-art LM-based approaches, scaling effectively to complex scenarios involving hundreds of objects.
📝 Abstract
We study the usage of language models (LMs) for planning over world models specified in the Planning Domain Definition Language (PDDL). We prompt LMs to generate Python programs that serve as generalised policies for solving PDDL problems from a given domain. Notably, our approach synthesises policies that are provably sound relative to the PDDL domain without reliance on external verifiers. We conduct experiments on competition benchmarks which show that our policies can solve more PDDL problems than PDDL planners and recent LM approaches within a fixed time and memory constraint. Our approach manifests in the LMPlan planner which can solve planning problems with several hundreds of relevant objects. Surprisingly, we observe that LMs used in our framework sometimes plan more effectively over PDDL problems written in meaningless symbols in place of natural language; e.g. rewriting (at dog kitchen) as (p2 o1 o3). This finding challenges hypotheses that LMs reason over word semantics and memorise solutions from its training corpus, and is worth further exploration.