Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current evaluations of prompt injection attacks predominantly focus on attack feasibility while overlooking the heterogeneous harms inflicted on diverse stakeholders, thereby failing to capture the true risk landscape in real-world web-based intelligent agent systems. This work proposes the first stakeholder-centered evaluation framework that distinguishes affected entities, decouples attack objectives, and integrates dual-layer metrics—spanning both outcome-level and process-level dimensions—to enable attributable and interpretable vulnerability assessment. The framework uncovers novel harm patterns such as covert parasitism, misaligned interference, and compound failures. Empirical results demonstrate that existing web agents cannot reliably defend against any single attack objective, exhibiting diverse failure modes that expose the limitations of conventional evaluation approaches and underscore the necessity and efficacy of stakeholder-aware assessment.

📝 Abstract

Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execute actions with direct consequences. This makes them vulnerable to prompt-injection attacks, in which seemingly benign content embeds adversarial instructions that manipulate agent behaviour. Existing security benchmarks adopt an \textit{attack-centric} perspective, focusing on the technical feasibility of injections while overlooking the nuanced distribution of resulting harms. In practice, however, prompt-injection risk is victim-dependent: a single exploit can produce asymmetric consequences for different stakeholders, and the same attack pattern may exhibit substantially different effectiveness depending on whom it targets. To capture these properties, we introduce \textbf{\sysname}, a \textit{stakeholder-centric} benchmark to systematically categorize and attribute harm in real-world web agent systems. It distinguishes between affected entities (e.g., user, seller, platform), decomposes the attacks into concrete objectives, and evaluates each case with complementary outcome- and process-level metrics. Our results reveal substantial and heterogeneous vulnerabilities: not a single attack objective is reliably resisted by current agents, and failures distribute across qualitatively distinct modes ranging from \emph{stealthy parasitism} (attack succeeds without disrupting the user's delegated task) to \emph{misaligned disruption} (task disrupted without attack success) and \emph{compounded failure} (both adversarial objective and task integrity simultaneously violated). These patterns are missed by conventional evaluation, highlighting the need for stakeholder-aware assessment of LLM-based agents in real-world deployments. Benchmark is available at https://github.com/StakeBench/SBC.

Problem

Research questions and friction points this paper is trying to address.

prompt injection

stakeholder-centric

web agents

LLM security

harm attribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

stakeholder-centric

prompt injection

web agents