🤖 AI Summary
Quantitative evaluation of how prompt engineering techniques affect the security of code generated by large language models (LLMs) remains underexplored. Method: We introduce an automated security benchmark integrating two authoritative prompt datasets with static analysis tools (Semgrep, CodeQL) to systematically evaluate GPT-3.5-turbo, GPT-4o, and GPT-4o-mini. We further propose the “Prompt Agent” framework, combining security-aware prefix prompting and multi-turn iterative repair. Contribution/Results: Security-aware prefix prompts reduce vulnerability rates in GPT-4o and GPT-4o-mini by up to 56%. Under iterative prompting, vulnerability repair rates reach 41.9%–68.7%, demonstrating practical feasibility and effectiveness for integration into real-world development workflows. This work bridges a critical gap in empirical, security-focused LLM evaluation and advances prompt-based mitigation strategies for insecure code generation.
📝 Abstract
Prompt engineering reduces reasoning mistakes in Large Language Models (LLMs). However, its effectiveness in mitigating vulnerabilities in LLM-generated code remains underexplored. To address this gap, we implemented a benchmark to automatically assess the impact of various prompt engineering strategies on code security. Our benchmark leverages two peer-reviewed prompt datasets and employs static scanners to evaluate code security at scale. We tested multiple prompt engineering techniques on GPT-3.5-turbo, GPT-4o, and GPT-4o-mini. Our results show that for GPT-4o and GPT-4o-mini, a security-focused prompt prefix can reduce the occurrence of security vulnerabilities by up to 56%. Additionally, all tested models demonstrated the ability to detect and repair between 41.9% and 68.7% of vulnerabilities in previously generated code when using iterative prompting techniques. Finally, we introduce a"prompt agent"that demonstrates how the most effective techniques can be applied in real-world development workflows.