๐ค AI Summary
Existing monitoring approaches struggle to detect malicious intent in long-horizon behavior trajectories of large language model agents, particularly when such intent emerges only through multi-step sequences that appear innocuous in isolation and require distant evidence correlation. To address this challenge, this work proposes TRACE, a novel framework featuring a Triage-Inspect-Judge (TIJ) iterative mechanism. TRACE enables trajectory-level assessment by adaptively segmenting agent behavior, performing context-aware fine-grained inspection, and dynamically aggregating cross-step evidence. Evaluated across ten task domains in the SHADE-Arena benchmark, TRACE achieves a composite F1 score of 0.713 and a recall of 0.844, demonstrating substantial performance gainsโespecially in tasks demanding long-range evidential reasoning.
๐ Abstract
Autonomous LLM agents can pursue hidden malicious objectives through sequences of individually benign actions, making sabotage difficult to detect using standard trajectory-level monitoring. Existing approaches either evaluate complete trajectories in a single pass or partition them into independently scored windows, limiting their ability to connect evidence across temporally distant actions. We propose TRACE, a monitoring framework for long-horizon LLM agent trajectories. TRACE operates through a TIJ (Triage-Inspect-Judge) loop that identifies high-signal regions, performs targeted inspection while maintaining accumulated evidence across reasoning steps, and synthesizes a trajectory-level verdict. We evaluate TRACE on ten task domains from SHADE-Arena against state-of-the-art baselines. TRACE achieves an aggregate F1 of 0.713 and recall of 0.844, with the largest gains on tasks requiring long-range evidence linking.