TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Current agent evaluation predominantly relies on pass rates or reward scores, which often fail to uncover underlying decision-making differences or root causes of failure. This work proposes the first cross-model shared decision landscape representation that maps multi-agent interaction trajectories into a unified state graph, characterizing behavioral patterns through three event types: visitation, exposure, and repair. This enables the identification of efficient pathways and trap regions. Building upon this representation, the authors develop an interpretable trajectory diagnosis framework and introduce a lightweight runtime recovery strategy that combines state matching with policy selection to enable trap-aware intervention. Evaluated on SWE-bench, the approach improves solution rates from 40.4% to 43.5% on a specific subset and from 41.0% to 44.8% on publicly triggered instances.

📝 Abstract

Agent benchmarks increasingly record rich interaction trajectories, yet evaluation often reduces each rollout to a pass rate or reward score. We introduce TraceGraph, a graph-based framework that turns released multi-model agent trajectories into shared decision landscapes. For each task, TraceGraph builds a graph over observable action-observation states from pooled rollouts before model identity is introduced. It then overlays outcome-informed productive cores and trap regions, and summarizes each rollout with three events: Access, Trap exposure, and Repair. Across trajectories spanning five benchmark splits, TraceGraph profiles reveal navigation differences hidden by aggregate scores and show that splits differ in whether they reward avoiding traps or recovering from them. The same TraceGraph landscape also motivates a trap-aware recovery pipeline for SWE-bench: aruntime detector fires on states matching historical trap regions, then lightweight continuation policies are evaluated from the same prefix. On fired states, the best pooled single-factor policy raises official resolved rate from 40.4% to 43.5% on the per-provider fired subset and from 41.0% to 44.8% on common-fired instances, with provider-specific active components. Overall, TraceGraph provides a process vocabulary for asking what agent benchmarks test, where models diverge on a shared landscape, and how failure regions can guide downstream improvement.

Problem

Research questions and friction points this paper is trying to address.

agent evaluation

interaction trajectories

decision landscapes

failure analysis

benchmark interpretation

Innovation

Methods, ideas, or system contributions that make the work stand out.

TraceGraph

decision landscape

trap-aware recovery