Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the limitation of existing evaluations for large language model (LLM) agents, which predominantly rely on end-to-end success rates and fail to diagnose root causes of planning failures. To this end, we introduce APB—the first fine-grained diagnostic benchmark specifically designed to assess LLM agents’ planning capabilities—encompassing 22 domains, 5 experimental settings, and 4,209 multimodal tasks. APB systematically evaluates critical planning skills, including goal decomposition, tool selection, constraint reasoning, and recognition of infeasible tasks. It features a hierarchical planning evaluation protocol, externally validated via ToolSandbox and τ²-bench, enabling comprehensive assessment of long-horizon planning, robustness to distractions, calibrated rejection mechanisms, and inference-time optimization. Experiments across 12 multimodal LLMs reveal widespread planning deficiencies, while APB-guided refinements substantially improve plan correctness, scoring, and downstream execution performance in three representative models.

📝 Abstract

Planning is central to LLM agents: before acting, an agent must decompose goals, select tools, reason over constraints, and decide when a task is infeasible. Yet existing agent evaluations often report only end-to-end success, making it difficult to determine whether failures stem from planning or execution. We introduce \textbf{Agent Planning Benchmark (APB)}, a planning-specific diagnostic benchmark with 4,209 multimodal cases across 22 domains and five settings, covering holistic planning, feedback-conditioned step-wise planning, and robustness under extraneous tools, broken tools, and unsolvable tasks. Across 12 MLLMs, APB reveals systematic weaknesses in long-horizon planning, tool-noise robustness, calibrated refusal, and inference-time refinement. We further validate APB on 200 ToolSandbox tasks and 200 $τ^2$-bench tasks, where APB-guided refinement consistently improves plan correctness, plan grade, and downstream execution metrics across three representative models. APB thus serves as an upstream diagnostic complement to execution benchmarks.

Problem

Research questions and friction points this paper is trying to address.

agent planning

planning evaluation

LLM agents

diagnostic benchmark

plan robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent Planning Benchmark

planning diagnostics

multimodal LLM agents