SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories

📅 2025-04-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited capability of large language models (LLMs) to generate secure code in realistic, multi-file repository contexts. To this end, we introduce SecRepoBench—the first repository-level, context-sensitive benchmark for secure code generation—comprising 318 C/C++ tasks across 27 real-world open-source repositories and covering 15 CWE vulnerability categories. Methodologically, we extend security evaluation from single-file to repository-level for the first time, proposing a repository-aware agent-based generation paradigm that integrates program slicing, vulnerability context modeling, and CWE-driven task construction, alongside a multi-model zero-/few-shot evaluation framework. Experimental results reveal: (1) 19 state-of-the-art models achieve only 31.7% average secure correctness; (2) performance on conventional benchmarks shows no significant correlation with repository-level secure code generation ability; and (3) SecRepoBench emerges as the most challenging benchmark for secure coding evaluation to date.

Technology Category

Application Category

📝 Abstract
This paper introduces SecRepoBench, a benchmark to evaluate LLMs on secure code generation in real-world repositories. SecRepoBench has 318 code generation tasks in 27 C/C++ repositories, covering 15 CWEs. We evaluate 19 state-of-the-art LLMs using our benchmark and find that the models struggle with generating correct and secure code. In addition, the performance of LLMs to generate self-contained programs as measured by prior benchmarks do not translate to comparative performance at generating secure and correct code at the repository level in SecRepoBench. We show that the state-of-the-art prompt engineering techniques become less effective when applied to the repository level secure code generation problem. We conduct extensive experiments, including an agentic technique to generate secure code, to demonstrate that our benchmark is currently the most difficult secure coding benchmark, compared to previous state-of-the-art benchmarks. Finally, our comprehensive analysis provides insights into potential directions for enhancing the ability of LLMs to generate correct and secure code in real-world repositories.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for secure code generation in real repositories
Assessing LLM struggles with correct and secure repository-level coding
Testing prompt engineering limits in repository-level secure code generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking LLMs for secure code generation
Evaluating 19 LLMs on 318 C/C++ tasks
Agentic technique for repository-level security
C
Connor Dilgren
University of Maryland
Purva Chiniya
Purva Chiniya
Amazon
L
Luke Griffith
University of Maryland
Y
Yu Ding
Google Deepmind
Yizheng Chen
Yizheng Chen
University of Maryland
AI SecurityLarge Language ModelsVulnerability Detection