🤖 AI Summary
This work addresses a critical gap in large language model (LLM) safety by introducing the novel task of *predictive monitoring*, which aims to proactively identify whether an LLM will engage in unethical behavior based on partial action trajectories. To facilitate research in this direction, the authors construct PreActBench, a benchmark dataset comprising 1,000 paired ethical and unethical trajectories across diverse domains, and propose the Prefix Foresight F1 metric to evaluate predictive performance. Experimental results reveal that humans excel at this foresight task, whereas even state-of-the-art LLMs and dedicated safety guardrail models lag significantly behind, highlighting their current limitations in anticipating risks before harmful actions occur. This study thus establishes a foundational framework for pre-emptive safety interventions in LLMs, filling a key void in prior research that has predominantly focused on post-hoc detection mechanisms.
📝 Abstract
Large language models (LLMs) are increasingly deployed as autonomous agents capable of executing multi-step action trajectories toward a given objective. While existing safety research has focused on detecting unethical behavior from complete trajectories, this paradigm is fundamentally retrospective: it identifies harm only after it has already occurred. In this work, we study a critical yet overlooked safety task, which we term Predictive Monitoring: given only a partial action trajectory, can a model infer whether it will culminate in an unethical action before the overt action is executed? To support this task, we present PreActBench, a benchmark of 1,000 paired ethical and unethical action trajectories spanning five domains. We evaluate a range of LLMs, safety guardrail models, and latent probing methods across varying fractions of the action trajectory using our Prefix Foresight F1 metric. Results show that while humans achieve promising performance, predictive monitoring remains challenging even for strong models, highlighting the need for future-oriented risk reasoning in LLM safety.