SecureAgentBench: Benchmarking Secure Code Generation under Realistic Vulnerability Scenarios

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing evaluations of large language model (LLM)-based code agents lack realism and comprehensiveness in assessing secure code generation—mainstream benchmarks ignore multi-file vulnerability contexts, risks of introducing new vulnerabilities, and joint verification of functional correctness and security. Method: We introduce the first comprehensive benchmark grounded in real-world open-source vulnerability scenarios (105 tasks), integrating multi-file editing, precise reproduction of vulnerability injection points, functional testing, and rigorous security validation—including static analysis, proof-of-concept (PoC) exploit execution, and detection of newly introduced vulnerabilities. Contribution/Results: We propose the first unified evaluation framework jointly measuring functional correctness and security robustness. Systematic evaluation of SWE-agent, OpenHands, and Aider—integrated with Claude 3.7 Sonnet, GPT-4.1, and DeepSeek-V3.1—reveals that only 15.2% of generated fixes are both functionally correct and free of new vulnerabilities; most agents introduce undocumented vulnerabilities; and explicit security instructions yield negligible security improvements.

Technology Category

Application Category

📝 Abstract

Large language model (LLM) powered code agents are rapidly transforming software engineering by automating tasks such as testing, debugging, and repairing, yet the security risks of their generated code have become a critical concern. Existing benchmarks have offered valuable insights but remain insufficient: they often overlook the genuine context in which vulnerabilities were introduced or adopt narrow evaluation protocols that fail to capture either functional correctness or newly introduced vulnerabilities. We therefore introduce SecureAgentBench, a benchmark of 105 coding tasks designed to rigorously evaluate code agents' capabilities in secure code generation. Each task includes (i) realistic task settings that require multi-file edits in large repositories, (ii) aligned contexts based on real-world open-source vulnerabilities with precisely identified introduction points, and (iii) comprehensive evaluation that combines functionality testing, vulnerability checking through proof-of-concept exploits, and detection of newly introduced vulnerabilities using static analysis. We evaluate three representative agents (SWE-agent, OpenHands, and Aider) with three state-of-the-art LLMs (Claude 3.7 Sonnet, GPT-4.1, and DeepSeek-V3.1). Results show that (i) current agents struggle to produce secure code, as even the best-performing one, SWE-agent supported by DeepSeek-V3.1, achieves merely 15.2% correct-and-secure solutions, (ii) some agents produce functionally correct code but still introduce vulnerabilities, including new ones not previously recorded, and (iii) adding explicit security instructions for agents does not significantly improve secure coding, underscoring the need for further research. These findings establish SecureAgentBench as a rigorous benchmark for secure code generation and a step toward more reliable software development with LLMs.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking secure code generation under realistic vulnerability scenarios

Evaluating code agents' capabilities in producing secure code solutions

Assessing vulnerability introduction during automated code generation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Realistic multi-file editing in large repositories

Aligned contexts from real-world open-source vulnerabilities

Comprehensive evaluation combining functionality and vulnerability testing

🔎 Similar Papers

No similar papers found.