🤖 AI Summary
Current evaluations of AI systems in cybersecurity lack end-to-end coverage of the full lifecycle of real-world vulnerability discovery and remediation, and are limited in both scale and realism. This work proposes the first scalable, operationally realistic end-to-end benchmark for assessing AI-driven cybersecurity capabilities. Leveraging an automated pipeline, the benchmark dynamically constructs realistic environments for vulnerability reproduction and patching based on 920 real-world vulnerabilities and 139 open-source projects. It enables, for the first time, systematic and reproducible evaluation of AI agents across critical stages—including vulnerability detection, proof-of-concept (PoC) generation, and patch synthesis—thereby addressing a critical gap in comprehensive assessment of AI’s capabilities in cybersecurity.
📝 Abstract
AI has the potential to transform cybersecurity by enabling systems that can autonomously detect, analyze, and remediate software vulnerabilities. However, existing cybersecurity evaluations of AI systems are limited in scale or scope, and fail to capture the end-to-end lifecycle of real-world software vulnerability discovery and remediation. To address this gap, we propose CyberGym-E2E, a large-scale and realistic end-to-end cybersecurity benchmark that comprehensively evaluates AI agents' abilities across the full lifecycle of vulnerability discovery, PoC generation, and patch generation. CyberGym-E2E is comprehensive and scalable, as we build an automated, agent-enhanced pipeline for transforming open-source vulnerability data into realistic evaluation environments. Currently, the benchmark consists of 920 real-world vulnerabilities across 139 different open-source projects.