🤖 AI Summary
This work addresses the vulnerability of autonomous AI agents to social engineering attacks, which can lead to significant leakage of personally identifiable information (PII). To this end, the authors introduce Scammer4U—the first pre-registered, multi-dimensional benchmark for social engineering attacks—encompassing 91 adversarial scenarios and 10 benign controls, systematically structured using an 8-axis taxonomy. An independent large language model is employed to evaluate the agent’s reasoning process. Experimental results reveal PII disclosure rates of 54%–93% in attack scenarios (0% in benign controls) without privacy-preserving guidance. Notably, even when agents detect suspicious content, 35.9% still disclose PII, exposing a critical “detection-to-action gap” that underscores the inadequacy of current defenses. The study further quantifies the effectiveness of various prompt-level mitigation strategies.
📝 Abstract
Deceptive web content, widely instantiated across the internet and commonly known as \textit{social-engineering attacks}, manipulates autonomous web agents into submitting users' personally identifiable information (PII) to attacker-controlled endpoints. In this paper, we show that social-engineering attacks are highly effective at extracting critical-tier PII from frontier web agents, posing a severe risk to deployed agentic systems. To quantify this risk, we introduce \textbf{\textsc{Scammer4U}}, a pre-registered benchmark of 91 attacker-controlled environments and 10 benign-twin baselines, spanning 8 attack vectors and 16 site categories on an 8-axis factorial taxonomy that isolates the causal contribution of individual attack design factors. Across frontier agents, we find that critical-tier PII leakage reaches 54--93\% under no privacy guidance, compared to 0\% on benign-twin baselines, confirming that leakage is attack-attributable rather than incidental form-filling. Escalating prompt-level mitigation yields sharply model-dependent reductions across the four families and remains insufficient to reliably prevent critical PII submission at the pooled level. Most critically, we identify a detection--action gap: agents whose reasoning an independent LLM judge confirms has flagged the site as suspicious still submit critical PII in 35.9\% of sessions, versus 66.1\% when no suspicion is verbalized, a 30.2\% gap robust across all four model families. Our findings reveal that defenses conditioned on the agent's own recognition of an attack are gating on the wrong signal, motivating output-level interception of outbound submissions that operates independently of the agent's reasoning loop.