🤖 AI Summary
This study investigates how prompt engineering paradigms—Zero-Shot, Chain-of-Thought (CoT), and Few-Shot—affect the maintainability, security, and reliability of code generated by ChatGPT. Method: Leveraging 7,583 code samples from the Dev-GPT dataset, we conduct static code quality analysis and apply the Kruskal-Wallis nonparametric test to systematically assess associations between prompt structures and multidimensional code quality metrics—an approach unprecedented in prior work. Contribution/Results: No statistically significant differences in code quality across the three dimensions were found among the prompt types. Zero-Shot prompting exhibits the broadest adoption and the lowest overall defect rate; structured prompts (CoT and Few-Shot) yield no empirically supported quality improvement. These findings challenge the prevalent assumption that increased prompt complexity inherently enhances code quality, providing an evidence-based benchmark and methodological guidance for prompt engineering in large language model–based code generation.
📝 Abstract
Large Language Models (LLMs) have rapidly transformed software development, especially in code generation. However, their inconsistent performance, prone to hallucinations and quality issues, complicates program comprehension and hinders maintainability. Research indicates that prompt engineering-the practice of designing inputs to direct LLMs toward generating relevant outputs-may help address these challenges. In this regard, researchers have introduced prompt patterns, structured templates intended to guide users in formulating their requests. However, the influence of prompt patterns on code quality has yet to be thoroughly investigated. An improved understanding of this relationship would be essential to advancing our collective knowledge on how to effectively use LLMs for code generation, thereby enhancing their understandability in contemporary software development. This paper empirically investigates the impact of prompt patterns on code quality, specifically maintainability, security, and reliability, using the Dev-GPT dataset. Results show that Zero-Shot prompting is most common, followed by Zero-Shot with Chain-of-Thought and Few-Shot. Analysis of 7583 code files across quality metrics revealed minimal issues, with Kruskal-Wallis tests indicating no significant differences among patterns, suggesting that prompt structure may not substantially impact these quality metrics in ChatGPT-assisted code generation.