🤖 AI Summary
Current large language model (LLM) benchmarks overlook the evaluation of tool failure handling and recovery capabilities in real-world scenarios. This work proposes ToolMaze, a novel benchmark that integrates directed acyclic graph (DAG)-structured task topologies with four types of tool perturbations—explicit/implicit × transient/permanent—and introduces metrics such as Perturbation Recovery Rate (PRR) to systematically distinguish between systematic replanning and blind trial-and-error behaviors in agents. Experiments reveal that nearly all models suffer significant performance degradation under perturbations, with implicit semantic failures alone reducing PRR by an average of 37%. Moreover, fault tolerance scales with model size at only 1/3.66 the rate of base task performance, highlighting a critical robustness bottleneck of current LLMs in complex, dynamic environments.
📝 Abstract
Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a $2 \times 2$ taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale $3.66\times$ slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.