🤖 AI Summary
To address the labor-intensive, error-prone, and low-transferable nature of manual prompt engineering across LLMs and tasks, this paper proposes the first framework that formalizes prompt optimization as a structured AutoML problem. Methodologically: (1) it jointly searches over high-level prompting paradigms (e.g., Chain-of-Thought, ReAct, ReWOO) and concrete prompt content; (2) it implements source-to-source optimization via PDL—a human-readable, editable, and reusable prompt description language; and (3) it integrates successive halving with a standardized prompt pattern library to enable human-in-the-loop iterative refinement. Evaluated on three diverse tasks across six LLMs (8B–70B parameters), our approach achieves an average accuracy gain of 9.5±17.5 percentage points, with a maximum improvement of 68.9 pp—demonstrating the strong model- and task-specificity of optimal prompting strategies.
📝 Abstract
The performance of large language models (LLMs) depends on how they are prompted, with choices spanning both the high-level prompting pattern (e.g., Zero-Shot, CoT, ReAct, ReWOO) and the specific prompt content (instructions and few-shot demonstrations). Manually tuning this combination is tedious, error-prone, and non-transferable across LLMs or tasks. Therefore, this paper proposes AutoPDL, an automated approach to discover good LLM agent configurations. Our method frames this as a structured AutoML problem over a combinatorial space of agentic and non-agentic prompting patterns and demonstrations, using successive halving to efficiently navigate this space. We introduce a library implementing common prompting patterns using the PDL prompt programming language. AutoPDL solutions are human-readable, editable, and executable PDL programs that use this library. This approach also enables source-to-source optimization, allowing human-in-the-loop refinement and reuse. Evaluations across three tasks and six LLMs (ranging from 8B to 70B parameters) show consistent accuracy gains ($9.5pm17.5$ percentage points), up to 68.9pp, and reveal that selected prompting strategies vary across models and tasks.