🤖 AI Summary
This study uncovers a paradox in iterative code generation by large language models (LLMs): security vulnerabilities exhibit non-monotonic evolution—risk intensifies with successive feedback rounds. Method: Conducting 40 controlled iterations across 400 code samples under four distinct prompting strategies, we employed Semgrep and Bandit for static and dynamic vulnerability scanning, complemented by hierarchical vulnerability clustering. Contribution/Results: We provide the first empirical evidence that critical vulnerabilities increase by 37.6% after just five iterations, revealing prompting-strategy-specific vulnerability patterns. Based on these findings, we propose a “human-in-the-loop verification” framework mandating manual security review between every iteration. Our work challenges the prevailing assumption that iterative refinement inherently improves security and delivers actionable, process-integrated governance guidelines for LLM-assisted software development.
📝 Abstract
The rapid adoption of Large Language Models(LLMs) for code generation has transformed software development, yet little attention has been given to how security vulnerabilities evolve through iterative LLM feedback. This paper analyzes security degradation in AI-generated code through a controlled experiment with 400 code samples across 40 rounds of"improvements"using four distinct prompting strategies. Our findings show a 37.6% increase in critical vulnerabilities after just five iterations, with distinct vulnerability patterns emerging across different prompting approaches. This evidence challenges the assumption that iterative LLM refinement improves code security and highlights the essential role of human expertise in the loop. We propose practical guidelines for developers to mitigate these risks, emphasizing the need for robust human validation between LLM iterations to prevent the paradoxical introduction of new security issues during supposedly beneficial code"improvements".