CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

229K/year
🤖 AI Summary
Current evaluations of AI systems in cybersecurity lack end-to-end coverage of the full lifecycle of real-world vulnerability discovery and remediation, and are limited in both scale and realism. This work proposes the first scalable, operationally realistic end-to-end benchmark for assessing AI-driven cybersecurity capabilities. Leveraging an automated pipeline, the benchmark dynamically constructs realistic environments for vulnerability reproduction and patching based on 920 real-world vulnerabilities and 139 open-source projects. It enables, for the first time, systematic and reproducible evaluation of AI agents across critical stages—including vulnerability detection, proof-of-concept (PoC) generation, and patch synthesis—thereby addressing a critical gap in comprehensive assessment of AI’s capabilities in cybersecurity.
📝 Abstract
AI has the potential to transform cybersecurity by enabling systems that can autonomously detect, analyze, and remediate software vulnerabilities. However, existing cybersecurity evaluations of AI systems are limited in scale or scope, and fail to capture the end-to-end lifecycle of real-world software vulnerability discovery and remediation. To address this gap, we propose CyberGym-E2E, a large-scale and realistic end-to-end cybersecurity benchmark that comprehensively evaluates AI agents' abilities across the full lifecycle of vulnerability discovery, PoC generation, and patch generation. CyberGym-E2E is comprehensive and scalable, as we build an automated, agent-enhanced pipeline for transforming open-source vulnerability data into realistic evaluation environments. Currently, the benchmark consists of 920 real-world vulnerabilities across 139 different open-source projects.
Problem

Research questions and friction points this paper is trying to address.

cybersecurity
AI agents
vulnerability discovery
end-to-end evaluation
real-world benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

end-to-end cybersecurity
AI agents
vulnerability discovery
automated benchmarking
real-world evaluation
Tianneng Shi
Tianneng Shi
UC Berkeley
R
Robin Rheem
UC Berkeley
D
Dongwei Jiang
Johns Hopkins University
M
Mona Wang
UC Berkeley
F
Francisco De La Riega
UC Berkeley
Zhun Wang
Zhun Wang
Graduate Student, UC Berkeley
J
Jingzhi Jiang
UC Berkeley
A
Alexander Cheung
UC Berkeley
S
Sean Tai
UC Berkeley
J
Jonah Cha
UC Berkeley
J
Jianhong Tu
UC Santa Cruz
G
Gabriel Han
UC Berkeley
Chenguang Wang
Chenguang Wang
UC Santa Cruz
Jingxuan He
Jingxuan He
UC Berkeley
SecurityMachine LearningProgramming Languages
Wenbo Guo
Wenbo Guo
UC Santa Barbara
Machine LearningSecurity
Dawn Song
Dawn Song
Professor of Computer Science, UC Berkeley
Computer Security and Privacy