🤖 AI Summary
This work addresses the challenge of detecting safety risks in long-horizon large language model agents, where risk signals are sparse, delayed, and combinatorial, rendering conventional turn-level detection methods ineffective at aggregating evidence across extended interactions. The authors propose a trajectory-level safety detection framework that reframes the problem as evidence compression and introduces a Compressor-Reader architecture. The Compressor, trained under trajectory-level supervision, encodes the full interaction trajectory into a compact latent evidence state, explicitly aggregating cross-turn risk cues to prevent early evidence loss. The Reader then leverages this latent state to assess overall safety and attend to critical segments. Combining trajectory-level supervision, latent state encoding, and long-context modeling, the method achieves state-of-the-art accuracy on ASSEBench, Pre-Ex-Bench, and R-Judge—surpassing strong baselines by up to 12.6 percentage points—and demonstrates superior robustness to context length scaling on LongSafety.
📝 Abstract
Long-horizon LLM agents produce safety evidence across long trajectories, where sparse, delayed, and compositional risk signals often escape local moderation. Existing turn-level or short-context detectors struggle to reliably retain and aggregate such evidence over extended horizons. We reframe long-horizon agent safety detection as trajectory-level evidence compression and propose Trajectory Risk-Aware Compression for Long-Horizon Agent Safety (TRACE). TRACE uses a Compressor-Reader design: the Compressor encodes the full trajectory into a compact latent evidence state under trajectory-level supervision, and the Reader judges the raw trajectory with this latent evidence state as a safety reference. This design helps aggregate dispersed risk cues and reduce premature evidence loss. Across ASSEBench, Pre-Ex-Bench, and R-Judge, TRACE achieves the best accuracy on all evaluated backbones, improving over strong baselines by up to 12.6 percentage points. On LongSafety, TRACE shows smaller performance degradation as context length grows. Attention visualizations and case studies suggest that the compressed reference helps the Reader focus on risk-critical segments and recover cross-step evidence. Code is available at https://github.com/Peregrine123/TRACE_official.