Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models

📅 2025-02-09
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
Quantitative evaluation of how prompt engineering techniques affect the security of code generated by large language models (LLMs) remains underexplored. Method: We introduce an automated security benchmark integrating two authoritative prompt datasets with static analysis tools (Semgrep, CodeQL) to systematically evaluate GPT-3.5-turbo, GPT-4o, and GPT-4o-mini. We further propose the “Prompt Agent” framework, combining security-aware prefix prompting and multi-turn iterative repair. Contribution/Results: Security-aware prefix prompts reduce vulnerability rates in GPT-4o and GPT-4o-mini by up to 56%. Under iterative prompting, vulnerability repair rates reach 41.9%–68.7%, demonstrating practical feasibility and effectiveness for integration into real-world development workflows. This work bridges a critical gap in empirical, security-focused LLM evaluation and advances prompt-based mitigation strategies for insecure code generation.

Technology Category

Application Category

📝 Abstract
Prompt engineering reduces reasoning mistakes in Large Language Models (LLMs). However, its effectiveness in mitigating vulnerabilities in LLM-generated code remains underexplored. To address this gap, we implemented a benchmark to automatically assess the impact of various prompt engineering strategies on code security. Our benchmark leverages two peer-reviewed prompt datasets and employs static scanners to evaluate code security at scale. We tested multiple prompt engineering techniques on GPT-3.5-turbo, GPT-4o, and GPT-4o-mini. Our results show that for GPT-4o and GPT-4o-mini, a security-focused prompt prefix can reduce the occurrence of security vulnerabilities by up to 56%. Additionally, all tested models demonstrated the ability to detect and repair between 41.9% and 68.7% of vulnerabilities in previously generated code when using iterative prompting techniques. Finally, we introduce a"prompt agent"that demonstrates how the most effective techniques can be applied in real-world development workflows.
Problem

Research questions and friction points this paper is trying to address.

Assessing prompt engineering for secure code generation
Mitigating vulnerabilities in LLM-generated code
Evaluating code security with static analysis tools
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated benchmark for prompt engineering
Static scanners for code security
Iterative prompting for vulnerability repair