A New Framework for Cybersecurity Refusals in AI Agents

📅 2026-05-31
📈 Citations: 0
Influential: 0
📄 PDF

career value

207K/year
🤖 AI Summary
This study addresses the critical gap in current AI agents’ inability to appropriately refuse harmful requests in offensive cybersecurity tasks, where an overemphasis on task completion often overrides safety considerations. We formally define the refusal boundary in this context for the first time, propose evaluable refusal criteria and a taxonomy, and introduce the first evaluation framework specifically designed to assess AI refusal behavior in offensive security scenarios. Leveraging large language model–based agent architectures, we conduct adversarial testing and robustness evaluations across diverse cyberattack settings on eight state-of-the-art models. Our findings reveal that only GPT-5.2 and GPT-5.1 Codex demonstrate meaningful refusal capabilities, while the remaining six models exhibit virtually no refusal behavior, underscoring a severe deficiency in current models’ safety alignment for offensive cybersecurity applications.
📝 Abstract
Agentic scaffolds have dramatically improved LLM performance on complex, long-horizon tasks, yielding both broad benefits and amplified risks in domains like cybersecurity. Existing benchmarks for AI agents in cybersecurity focus mainly on measuring proficiency--how effectively agents can complete offensive security tasks--but neglect a critical question: when and how should agents refuse harmful requests? We present the first framework for establishing refusal boundaries in offensive security contexts. Our framework defines (1) principled criteria for when tasks should be refused, (2) categories of tasks that warrant refusal, and (3) evaluation methodology for measuring agent robustness under both benign and adversarial conditions. We apply this framework to assess how current LLM-powered agents adhere to appropriate refusal boundaries across a range of web-based offensive security scenarios, finding that 6 of 8 frontier models tested show near-zero refusal rates, with only 2 models (GPT-5.2 and GPT-5.1 Codex) demonstrating any meaningful refusal behavior.
Problem

Research questions and friction points this paper is trying to address.

AI agents
cybersecurity
refusal boundaries
harmful requests
offensive security
Innovation

Methods, ideas, or system contributions that make the work stand out.

refusal framework
AI agent safety
offensive cybersecurity
LLM alignment
adversarial robustness
🔎 Similar Papers
No similar papers found.
E
Eliot Krzysztof Jones
Gray Swan AI
M
Mateusz Dziemian
Gray Swan AI
Matt Fredrikson
Matt Fredrikson
Carnegie Mellon University
Security and PrivacyFair & Trustworthy AIFormal Methods
J
J Zico Kolter
Gray Swan AI, Carnegie Mellon University