🤖 AI Summary
This work addresses the challenge of error localization in long-horizon reasoning trajectories of intelligent agents, where reliance solely on final answers obscures the specific steps at which failures occur. To this end, the authors propose DRIFT, an auditing framework that first segments agent trajectories into semantically coherent chunks and then leverages collaborative annotation by human experts and large language models to identify harmful errors. The framework introduces TEL Bench, a benchmark comprising 1,000 instances annotated at the segment level to support fine-grained error localization, and incorporates claim-evidence consistency verification for process-level reliability analysis. Experimental results demonstrate that DRIFT achieves up to a 30-percentage-point improvement over existing methods in both segment-level error localization and identification of the first erroneous step.
📝 Abstract
Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based on final answers shows whether an agent succeeds, but not which parts of the trajectory make the answer unreliable. We study span-level error localization for deep-research agents. We collect 2,790 real trajectories from two agent frameworks, three backbone models, and three benchmarks, convert raw logs into semantic spans, and annotate harmful error spans through LLM-assisted expert review. From these annotations, we build TELBench, a 1,000-instance benchmark for identifying error spans among normal exploration, failed searches, tentative hypotheses, and harmless noise. We further propose DRIFT, a claim-centric auditing framework that tracks agent claims, checks their support in trajectory evidence, and marks spans where unsupported or conflicting claims affect the answer path. Experiments across model families and auditing frameworks show that DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points. Our work provides a process-level view of reliability in deep-research agents.