🤖 AI Summary
This study systematically compares the maintainability and reliability of code generated by large language models (LLMs) versus human-written code. We evaluate zero-shot, few-shot, and fine-tuned LLM paradigms on multi-difficulty Python programming tasks, employing SonarQube for static analysis to quantify defect density, remediation cost, severity distribution, and architectural risk. Our key contribution is the first unified, quantitative assessment of code quality across diverse prompting strategies, revealing LLM-specific structural defects—such as high-risk anti-patterns absent in human code—particularly in complex or competitive programming tasks. Results indicate that LLM-generated code exhibits lower overall defect density and reduced average remediation cost compared to human counterparts; fine-tuning significantly decreases critical-severity defects but may incur runtime performance overhead. The study establishes empirically grounded quality benchmarks and actionable optimization guidelines for production-grade LLM-assisted software development.
📝 Abstract
Background: The rise of Large Language Models (LLMs) in software development has opened new possibilities for code generation. Despite the widespread use of this technology, it remains unclear how well LLMs generate code solutions in terms of software quality and how they compare to human-written code. Aims: This study compares the internal quality attributes of LLM-generated and human-written code. Method: Our empirical study integrates datasets of coding tasks, three LLM configurations (zero-shot, few-shot, and fine-tuning), and SonarQube to assess software quality. The dataset comprises Python code solutions across three difficulty levels: introductory, interview, and competition. We analyzed key code quality metrics, including maintainability and reliability, and the estimated effort required to resolve code issues. Results: Our analysis shows that LLM-generated code has fewer bugs and requires less effort to fix them overall. Interestingly, fine-tuned models reduced the prevalence of high-severity issues, such as blocker and critical bugs, and shifted them to lower-severity categories, but decreased the model's performance. In competition-level problems, the LLM solutions sometimes introduce structural issues that are not present in human-written code. Conclusion: Our findings provide valuable insights into the quality of LLM-generated code; however, the introduction of critical issues in more complex scenarios highlights the need for a systematic evaluation and validation of LLM solutions. Our work deepens the understanding of the strengths and limitations of LLMs for code generation.