Is LLM-Generated Code More Maintainable & Reliable than Human-Written Code?

📅 2025-08-01

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This study systematically compares the maintainability and reliability of code generated by large language models (LLMs) versus human-written code. We evaluate zero-shot, few-shot, and fine-tuned LLM paradigms on multi-difficulty Python programming tasks, employing SonarQube for static analysis to quantify defect density, remediation cost, severity distribution, and architectural risk. Our key contribution is the first unified, quantitative assessment of code quality across diverse prompting strategies, revealing LLM-specific structural defects—such as high-risk anti-patterns absent in human code—particularly in complex or competitive programming tasks. Results indicate that LLM-generated code exhibits lower overall defect density and reduced average remediation cost compared to human counterparts; fine-tuning significantly decreases critical-severity defects but may incur runtime performance overhead. The study establishes empirically grounded quality benchmarks and actionable optimization guidelines for production-grade LLM-assisted software development.

Technology Category

Application Category

📝 Abstract

Background: The rise of Large Language Models (LLMs) in software development has opened new possibilities for code generation. Despite the widespread use of this technology, it remains unclear how well LLMs generate code solutions in terms of software quality and how they compare to human-written code. Aims: This study compares the internal quality attributes of LLM-generated and human-written code. Method: Our empirical study integrates datasets of coding tasks, three LLM configurations (zero-shot, few-shot, and fine-tuning), and SonarQube to assess software quality. The dataset comprises Python code solutions across three difficulty levels: introductory, interview, and competition. We analyzed key code quality metrics, including maintainability and reliability, and the estimated effort required to resolve code issues. Results: Our analysis shows that LLM-generated code has fewer bugs and requires less effort to fix them overall. Interestingly, fine-tuned models reduced the prevalence of high-severity issues, such as blocker and critical bugs, and shifted them to lower-severity categories, but decreased the model's performance. In competition-level problems, the LLM solutions sometimes introduce structural issues that are not present in human-written code. Conclusion: Our findings provide valuable insights into the quality of LLM-generated code; however, the introduction of critical issues in more complex scenarios highlights the need for a systematic evaluation and validation of LLM solutions. Our work deepens the understanding of the strengths and limitations of LLMs for code generation.

Problem

Research questions and friction points this paper is trying to address.

Compare code quality between LLM-generated and human-written code

Assess maintainability and reliability of LLM-generated code solutions

Evaluate effort to fix issues in LLM vs human code

Innovation

Methods, ideas, or system contributions that make the work stand out.

Empirical study with three LLM configurations

SonarQube for software quality assessment

Analyzed Python code across difficulty levels

🔎 Similar Papers

Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models