AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing benchmarks struggle to effectively evaluate the adaptive planning capabilities of large language models under dynamically revealed constraints from both the world and users. To address this gap, this work proposes AdaPlanBench—a dynamic, interactive evaluation benchmark built upon 307 household tasks. Through a multi-turn interaction protocol, hidden constraints are incrementally disclosed whenever an agent violates them, compelling continuous reasoning and iterative replanning. The framework introduces, for the first time, systematic support for dual-constraint revelation, featuring an extensible constraint-generation pipeline, an interactive simulation environment, an automated constraint-injection mechanism, and a multi-round plan validation protocol. Experiments across ten mainstream large language models reveal that even the best-performing model achieves only 67.75% accuracy, with performance markedly degrading as constraints accumulate—particularly under user-imposed constraints.

📝 Abstract

Planning for real-world problems by language models often involves both world and user constraints, which may not be fully specified upfront and are progressively disclosed through interaction. However, existing benchmarks still underexplore adaptive planning under such progressively revealed dual constraints. To address this gap, we introduce AdaPlanBench, a dynamic interactive benchmark for evaluating whether Large Language Model (LLM) agents can adaptively plan and re-plan under progressively revealed world and user constraints. AdaPlanBench is built on 307 household tasks, with a scalable constraint construction pipeline that augments each task with dual constraints. At runtime, agents interact with the environment in a multi-turn protocol where hidden constraints are revealed only when the agent proposes a plan that violates them, requiring iterative plan revision under accumulating feedback. This makes planning challenging, as agents must infer and track constraints from feedback while re-planning effectively. Experiments on ten leading LLMs show that adaptive planning under dual constraints remains challenging, with the best model reaching only 67.75% accuracy. We further observe that performance degrades as more constraints accumulate, with user constraints posing a particularly large challenge and failures often stemming from weaker physical grounding and reduced effectiveness. These results establish AdaPlanBench as a testbed for dual-constrained interactive planning and highlight the challenge of reliable adaptation to dynamically revealed constraints in LLM agents.

Problem

Research questions and friction points this paper is trying to address.

adaptive planning

large language models

world constraints

user constraints

interactive planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive planning

large language models

interactive benchmark