How to Compare the Security of Code Written by Humans to LLM-generated Code

๐Ÿ“… 2026-05-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

201K/year
๐Ÿค– AI Summary
This study addresses the current lack of standardized methodologies for empirically comparing the security of code produced by humans, large language models (LLMs), and humanโ€“AI collaboration. To bridge this gap, the authors propose a โ€œspecies-fairโ€ experimental paradigm and develop an automated evaluation framework that enables systematic, apples-to-apples comparisons under unified conditions. The framework integrates multidimensional static analysis and dynamic testing to assess code security across the three categories, while automatically logging prompts, timestamps, and experimental configurations. This work presents the first reproducible and scalable approach to comparative security evaluation between human and AI-generated code, demonstrates its feasibility through empirical validation, and provides an open-source experimental blueprint to support future research in this emerging domain.
๐Ÿ“ Abstract
Large language models (LLMs) are rapidly transforming how software is created and maintained. Comparing LLM-generated code against human-written standards is essential to determine whether these new tools uphold or erode the security baselines established by professional developers. Yet, we lack a standardized method for empirically comparing the security of code produced through human-LLM collaboration against LLM-only, or traditional human-only methods. To facilitate this, we propose an automated framework for conducting comparative studies across human-only, LLM-only, and hybrid conditions. Our approach automates the logging of prompts, timing, and experimental settings, measuring outcomes through multi-dimensional static and dynamic quality analysis. We provide an open-source implementation of this framework to ensure that future researchers can conduct reproducible, species-fair experiments. Importantly, we validate the framework via a feasibility study, providing an experimental blueprint for ``species-fair'' comparisons between human and AI subjects. By sharing lessons learned, we establish a foundation for empirical research on human and LLM-generated code for software security.
Problem

Research questions and friction points this paper is trying to address.

code security
large language models
human-AI collaboration
empirical comparison
software security
Innovation

Methods, ideas, or system contributions that make the work stand out.

automated framework
species-fair comparison
LLM-generated code
code security
empirical evaluation
๐Ÿ”Ž Similar Papers