🤖 AI Summary
This work addresses the limitation of existing evaluations for large language model (LLM) agents, which predominantly rely on end-to-end success rates and fail to diagnose root causes of planning failures. To this end, we introduce APB—the first fine-grained diagnostic benchmark specifically designed to assess LLM agents’ planning capabilities—encompassing 22 domains, 5 experimental settings, and 4,209 multimodal tasks. APB systematically evaluates critical planning skills, including goal decomposition, tool selection, constraint reasoning, and recognition of infeasible tasks. It features a hierarchical planning evaluation protocol, externally validated via ToolSandbox and τ²-bench, enabling comprehensive assessment of long-horizon planning, robustness to distractions, calibrated rejection mechanisms, and inference-time optimization. Experiments across 12 multimodal LLMs reveal widespread planning deficiencies, while APB-guided refinements substantially improve plan correctness, scoring, and downstream execution performance in three representative models.
📝 Abstract
Planning is central to LLM agents: before acting, an agent must decompose goals, select tools, reason over constraints, and decide when a task is infeasible. Yet existing agent evaluations often report only end-to-end success, making it difficult to determine whether failures stem from planning or execution. We introduce \textbf{Agent Planning Benchmark (APB)}, a planning-specific diagnostic benchmark with 4,209 multimodal cases across 22 domains and five settings, covering holistic planning, feedback-conditioned step-wise planning, and robustness under extraneous tools, broken tools, and unsolvable tasks. Across 12 MLLMs, APB reveals systematic weaknesses in long-horizon planning, tool-noise robustness, calibrated refusal, and inference-time refinement. We further validate APB on 200 ToolSandbox tasks and 200 $τ^2$-bench tasks, where APB-guided refinement consistently improves plan correctness, plan grade, and downstream execution metrics across three representative models. APB thus serves as an upstream diagnostic complement to execution benchmarks.