🤖 AI Summary
Existing security monitoring systems are confined to single-account contexts and struggle to detect distributed agent attacks that involve coordinated activity across multiple accounts. This work introduces the first distributed attack paradigm designed to evade conventional detection mechanisms and proposes a novel online monitoring framework capable of cross-user joint inference. By dynamically clustering and aggregating weak suspicious signals from multiple accounts in real time, the framework employs lightweight scoring combined with on-demand invocation of large language models to determine abusive behavior. Experimental evaluation demonstrates that the approach detects attacks 30% earlier than baseline methods on large-scale data center traffic, incurs negligible latency overhead under 99% normal traffic conditions, and serendipitously proves effective at capturing common jailbreaking attempts.
📝 Abstract
Language models can find thousands of severe software vulnerabilities, and agents are increasingly being misused for cyberattacks. To avoid detection, attackers frequently distribute their misuse, splitting a harmful task across many user accounts so each individual transcript looks benign. Because safety monitors score only one agent context at a time, they are structurally blind to misuse that is only visible in aggregate, across many accounts. We show this gap is real by building, to our knowledge, the first distributed agent attack, a multi-agent scaffold that completes hard cybersecurity tasks while hiding the harmful objective across subagents with limited contexts, evading a standard monitor that catches it only a fifth as often as prior agent attacks. Towards a defense, we develop an online stateful monitor that uses real-time clustering to collect weak suspiciousness signals across many agent transcripts, and escalates only rarely to a language model that flags misuse across user accounts. In evaluations with large-scale simulated datacenter traffic, our monitor Pareto dominates standard monitors, catching distributed attacks 30% earlier and flagging cyber misuse before it reaches the most harmful stages. Crucially, this comes at negligible additional latency for ~99% of user traffic. This detection advantage persists but narrows as the benign background traffic grows very large. After an extensive red-teaming exercise, we improve the defense and surprisingly also find that it catches standard jailbreaks, since adaptive attackers reuse attack variants across accounts. Our results point toward a new class of safety monitors which reason over groups of users rather than isolated transcripts.