CyberGym: Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based agent security evaluation frameworks lack coverage of real-world vulnerabilities and end-to-end reproducibility. Method: We introduce the first large-scale AI security benchmark grounded in authentic vulnerabilities—comprising 1,507 patched vulnerabilities across 188 open-source projects—with the core task of generating executable proof-of-concept (PoC) exploits from natural-language descriptions and source code. Our framework systematically transforms historical vulnerabilities into reproducible, scalable evaluation instances, leveraging a multi-agent architecture (e.g., OpenHands), state-of-the-art LLMs (e.g., Claude-3.7-Sonnet), and cross-file code reasoning capabilities to automate the full pipeline from vulnerability comprehension to PoC generation. Contribution/Results: Under optimal configuration, current LLM agents achieve only 11.9% vulnerability reproduction rate, revealing fundamental limitations in complex security reasoning. Notably, our agents independently discovered 15 zero-day vulnerabilities affecting up-to-date software versions—demonstrating both the benchmark’s rigor and unintended discovery potential.

Technology Category

Application Category

📝 Abstract
Large language model (LLM) agents are becoming increasingly skilled at handling cybersecurity tasks autonomously. Thoroughly assessing their cybersecurity capabilities is critical and urgent, given the high stakes in this domain. However, existing benchmarks fall short, often failing to capture real-world scenarios or being limited in scope. To address this gap, we introduce CyberGym, a large-scale and high-quality cybersecurity evaluation framework featuring 1,507 real-world vulnerabilities found and patched across 188 large software projects. While it includes tasks of various settings, CyberGym primarily focuses on the generation of proof-of-concept (PoC) tests for vulnerability reproduction, based on text descriptions and corresponding source repositories. Solving this task is particularly challenging, as it requires comprehensive reasoning across entire codebases to locate relevant code fragments and produce effective PoCs that accurately trigger the target vulnerability starting from the program's entry point. Our evaluation across 4 state-of-the-art agent frameworks and 9 LLMs reveals that even the best combination (OpenHands and Claude-3.7-Sonnet) achieves only a 11.9% reproduction success rate, mainly on simpler cases. Beyond reproducing historical vulnerabilities, we find that PoCs generated by LLM agents can reveal new vulnerabilities, identifying 15 zero-days affecting the latest versions of the software projects.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLM agents' cybersecurity capabilities with real-world vulnerabilities
Generating proof-of-concept tests for vulnerability reproduction from descriptions
Evaluating agent frameworks and LLMs on large-scale cybersecurity tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale framework with 1,507 real vulnerabilities
Focuses on PoC generation from text descriptions
Evaluates 4 agent frameworks and 9 LLMs