A New Framework for Cybersecurity Refusals in AI Agents

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This study addresses the critical gap in current AI agents’ inability to appropriately refuse harmful requests in offensive cybersecurity tasks, where an overemphasis on task completion often overrides safety considerations. We formally define the refusal boundary in this context for the first time, propose evaluable refusal criteria and a taxonomy, and introduce the first evaluation framework specifically designed to assess AI refusal behavior in offensive security scenarios. Leveraging large language model–based agent architectures, we conduct adversarial testing and robustness evaluations across diverse cyberattack settings on eight state-of-the-art models. Our findings reveal that only GPT-5.2 and GPT-5.1 Codex demonstrate meaningful refusal capabilities, while the remaining six models exhibit virtually no refusal behavior, underscoring a severe deficiency in current models’ safety alignment for offensive cybersecurity applications.

📝 Abstract

Agentic scaffolds have dramatically improved LLM performance on complex, long-horizon tasks, yielding both broad benefits and amplified risks in domains like cybersecurity. Existing benchmarks for AI agents in cybersecurity focus mainly on measuring proficiency--how effectively agents can complete offensive security tasks--but neglect a critical question: when and how should agents refuse harmful requests? We present the first framework for establishing refusal boundaries in offensive security contexts. Our framework defines (1) principled criteria for when tasks should be refused, (2) categories of tasks that warrant refusal, and (3) evaluation methodology for measuring agent robustness under both benign and adversarial conditions. We apply this framework to assess how current LLM-powered agents adhere to appropriate refusal boundaries across a range of web-based offensive security scenarios, finding that 6 of 8 frontier models tested show near-zero refusal rates, with only 2 models (GPT-5.2 and GPT-5.1 Codex) demonstrating any meaningful refusal behavior.

Problem

Research questions and friction points this paper is trying to address.

AI agents

cybersecurity

refusal boundaries

harmful requests

offensive security

Innovation

Methods, ideas, or system contributions that make the work stand out.

refusal framework

AI agent safety

offensive cybersecurity