FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the challenge of precise error attribution in large language model–based multi-agent systems, where task failures are often obscured by error propagation along execution trajectories. To this end, the authors propose FALAT, a novel framework that introduces dependency-aware reasoning for failure diagnosis. FALAT constructs expected execution trajectories to localize suspicious regions and explicitly models dependencies among decisions, tool invocations, and inter-agent messages, thereby distinguishing between error introduction points and subsequent propagation points while validating the impact of potential corrections. Moving beyond conventional approaches that treat each step in isolation, FALAT enables accurate root-cause tracing of failures. Experimental results demonstrate that FALAT significantly outperforms existing methods on the Who&When benchmark, achieving step-level attribution accuracy of 46.0% on algorithmically generated trajectories and 29.1% on human-crafted ones.

📝 Abstract

LLM-based agents increasingly solve complex tasks through long trajectories involving reasoning steps, tool calls, and inter-agent communication. However, when these agents fail, it is often unclear which agent caused the failure and which step introduced the decisive error. This attribution problem is challenging because mistakes can propagate across the trajectory: later actions may appear incorrect, but only because they depend on an earlier corrupted state. Therefore, failure attribution cannot be treated as independent step-level classification. We propose FALAT, a diagnostic framework for failure attribution in LLM agent trajectories. FALAT frames attribution as a dependency-guided search problem. It first constructs an expectation of how the task should be solved and uses this expectation to identify suspicious regions in the trajectory. It then traces dependencies among decisions, tool outputs, and agent messages to distinguish error-introducing steps from steps that merely inherit or propagate prior mistakes. Finally, FALAT evaluates whether correcting a candidate step would be sufficient to recover the expected outcome, allowing it to identify both the responsible agent and the decisive failure step. We evaluate FALAT on the Who&When benchmark, which includes both algorithm-generated and hand-crafted multi-agent failure trajectories. The results show that FALAT consistently improves responsible-agent and decisive-step attribution. Its best configurations achieve 46.0% step-level accuracy on algorithm-generated trajectories and 29.1% on the more challenging hand-crafted trajectories, outperforming specialized attribution baselines and direct prompting with standalone LLMs. These findings suggest that dependency-aware reasoning is essential for reliable failure diagnosis in LLM agent systems.

Problem

Research questions and friction points this paper is trying to address.

failure attribution

LLM agents

error propagation

dependency tracing

agent trajectories

Innovation

Methods, ideas, or system contributions that make the work stand out.

failure attribution

dependency-guided search

LLM agents