When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Current large language model (LLM) benchmarks overlook the evaluation of tool failure handling and recovery capabilities in real-world scenarios. This work proposes ToolMaze, a novel benchmark that integrates directed acyclic graph (DAG)-structured task topologies with four types of tool perturbations—explicit/implicit × transient/permanent—and introduces metrics such as Perturbation Recovery Rate (PRR) to systematically distinguish between systematic replanning and blind trial-and-error behaviors in agents. Experiments reveal that nearly all models suffer significant performance degradation under perturbations, with implicit semantic failures alone reducing PRR by an average of 37%. Moreover, fault tolerance scales with model size at only 1/3.66 the rate of base task performance, highlighting a critical robustness bottleneck of current LLMs in complex, dynamic environments.

📝 Abstract

Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a $2 \times 2$ taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale $3.66\times$ slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.

Problem

Research questions and friction points this paper is trying to address.

Tool Failure

Dynamic Replanning

Anomaly Recovery

LLM Agents

Perturbation Robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic replanning

anomaly recovery

tool-integrated reasoning