How to Compare the Security of Code Written by Humans to LLM-generated Code

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This study addresses the current lack of standardized methodologies for empirically comparing the security of code produced by humans, large language models (LLMs), and human–AI collaboration. To bridge this gap, the authors propose a “species-fair” experimental paradigm and develop an automated evaluation framework that enables systematic, apples-to-apples comparisons under unified conditions. The framework integrates multidimensional static analysis and dynamic testing to assess code security across the three categories, while automatically logging prompts, timestamps, and experimental configurations. This work presents the first reproducible and scalable approach to comparative security evaluation between human and AI-generated code, demonstrates its feasibility through empirical validation, and provides an open-source experimental blueprint to support future research in this emerging domain.

📝 Abstract

Large language models (LLMs) are rapidly transforming how software is created and maintained. Comparing LLM-generated code against human-written standards is essential to determine whether these new tools uphold or erode the security baselines established by professional developers. Yet, we lack a standardized method for empirically comparing the security of code produced through human-LLM collaboration against LLM-only, or traditional human-only methods. To facilitate this, we propose an automated framework for conducting comparative studies across human-only, LLM-only, and hybrid conditions. Our approach automates the logging of prompts, timing, and experimental settings, measuring outcomes through multi-dimensional static and dynamic quality analysis. We provide an open-source implementation of this framework to ensure that future researchers can conduct reproducible, species-fair experiments. Importantly, we validate the framework via a feasibility study, providing an experimental blueprint for ``species-fair'' comparisons between human and AI subjects. By sharing lessons learned, we establish a foundation for empirical research on human and LLM-generated code for software security.

Problem

Research questions and friction points this paper is trying to address.

code security

large language models

human-AI collaboration

empirical comparison

software security

Innovation

Methods, ideas, or system contributions that make the work stand out.

automated framework

species-fair comparison

LLM-generated code