Human-Written vs. AI-Generated Code: A Large-Scale Study of Defects, Vulnerabilities, and Complexity

📅 2025-08-29

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This study addresses the unresolved question of comparative code quality between human developers and AI code assistants amid their growing adoption. We conduct the first large-scale empirical comparison, analyzing over 500,000 Python and Java code samples. Our methodology integrates orthogonal defect classification (ODC) and the Common Weakness Enumeration (CWE) framework, augmented by static analysis and software complexity metrics, to systematically assess differences in defect prevalence, security vulnerabilities, and structural characteristics. Results reveal that AI-generated code exhibits lower structural complexity but significantly higher incidence rates of critical security weaknesses—particularly hardcoded credentials and insecure deserialization—alongside a distinct defect distribution pattern. In contrast, human-written code demonstrates higher cyclomatic and cognitive complexity, posing greater maintainability challenges. These findings provide an empirical foundation and a taxonomy-informed basis for designing differentiated quality assurance mechanisms tailored to human- versus AI-generated code.

Technology Category

Application Category

📝 Abstract

As AI code assistants become increasingly integrated into software development workflows, understanding how their code compares to human-written programs is critical for ensuring reliability, maintainability, and security. In this paper, we present a large-scale comparison of code authored by human developers and three state-of-the-art LLMs, i.e., ChatGPT, DeepSeek-Coder, and Qwen-Coder, on multiple dimensions of software quality: code defects, security vulnerabilities, and structural complexity. Our evaluation spans over 500k code samples in two widely used languages, Python and Java, classifying defects via Orthogonal Defect Classification and security vulnerabilities using the Common Weakness Enumeration. We find that AI-generated code is generally simpler and more repetitive, yet more prone to unused constructs and hardcoded debugging, while human-written code exhibits greater structural complexity and a higher concentration of maintainability issues. Notably, AI-generated code also contains more high-risk security vulnerabilities. These findings highlight the distinct defect profiles of AI- and human-authored code and underscore the need for specialized quality assurance practices in AI-assisted programming.

Problem

Research questions and friction points this paper is trying to address.

Compares AI and human code defects and vulnerabilities

Evaluates code quality across complexity and security dimensions

Identifies distinct defect profiles in AI versus human code

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale comparison of human and AI code

Evaluation of defects, vulnerabilities, and complexity

Analysis of 500k Python and Java samples

🔎 Similar Papers

How Well Do Large Language Models Serve as End-to-End Secure Code Producers?