The Impact of Prompt Programming on Function-Level Code Generation

📅 2024-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically investigates how prompt engineering techniques affect function-level code generation quality across large language models (GPT-4o, Llama3, Mistral). Method: We introduce CodePromptEval—the first controllable, function-generation–oriented prompt evaluation benchmark—comprising 7,072 prompts and covering five categories of single and composite prompting strategies. Evaluation employs a multi-dimensional metric suite: functional correctness (FuncCorrect), semantic similarity (CodeBLEU), and functional quality. Contribution/Results: We find that few-shot prompting and chain-of-thought significantly improve correctness, yet combining multiple strategies yields no synergistic gains; moreover, correctness often trades off against functional quality. To support reproducible, decomposable research in prompt engineering, we open-source the benchmark dataset, evaluation framework, and full reproduction package.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly used by software engineers for code generation. However, limitations of LLMs such as irrelevant or incorrect code have highlighted the need for prompt programming (or prompt engineering) where engineers apply specific prompt techniques (e.g., chain-of-thought or input-output examples) to improve the generated code. Despite this, the impact of different prompt techniques -- and their combinations -- on code generation remains underexplored. In this study, we introduce CodePromptEval, a dataset of 7072 prompts designed to evaluate five prompt techniques (few-shot, persona, chain-of-thought, function signature, list of packages) and their effect on the correctness, similarity, and quality of complete functions generated by three LLMs (GPT-4o, Llama3, and Mistral). Our findings show that while certain prompt techniques significantly influence the generated code, combining multiple techniques does not necessarily improve the outcome. Additionally, we observed a trade-off between correctness and quality when using prompt techniques. Our dataset and replication package enable future research on improving LLM-generated code and evaluating new prompt techniques.
Problem

Research questions and friction points this paper is trying to address.

Prompt Engineering
Large Language Models
Code Generation Quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

CodePromptEval
Prompting Techniques
Code Quality Assessment
🔎 Similar Papers
No similar papers found.