LLM vs. Human Unit Tests: Fault Detection on Real Python Bugs

๐Ÿ“… 2026-06-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study systematically evaluates the fault-detection capability of unit tests generated by large language models (LLMs) in real-world Python defect scenarios. By integrating Gemini 2.5 Flash with a lightweight lexical retrieval mechanism, the authors generate context-enhanced tests on established real-defect benchmarks such as BugsInPy and conduct a multidimensional comparison against human-written tests. This work presents the first such evaluation on historical real-world defects, revealing that retrieval-augmented LLM-generated tests successfully detect faults in 69% of casesโ€”significantly outperforming human-written tests, which achieve only 17.2% detection (p<0.001). Notably, both test types exhibit comparable code coverage, suggesting that coverage is an inadequate proxy for fault detection effectiveness and underscoring the critical role of retrieved context in enhancing test quality.
๐Ÿ“ Abstract
Large language models (LLMs) have shown considerable promise for automated unit test generation, yet their practical effectiveness relative to human-written tests remains poorly understood. Existing evaluations commonly rely on coverage-oriented benchmarks that do not assess fault-detection capability directly. We present an empirical comparison of LLM-generated and human-written unit tests across three complementary Python benchmarks: 29 real historical bugs from BugsInPy, a function-level benchmark drawn from python-slugify and packaging, and a controlled paired benchmark. Our generation pipeline couples Gemini 2.5 Flash with a lightweight lexical retrieval mechanism that supplies bug-relevant context at generation time. Across eight quality dimensions, LLM-generated tests with retrieval-augmented context detect faults in 69% of cases compared to 17.2% for general-purpose human-written tests (Fisher's exact, $p < 0.001$, Cohen's $h = 1.10$). Critically, line and branch coverage are nearly identical between the two approaches (84.8% vs. 88.5% and 75.2% vs. 82.1%), confirming that coverage is an insufficient proxy for fault-detection capability. We discuss the conditions under which each approach excels, characterize their complementary strengths, and identify the critical role of retrieval context and reproducible benchmark construction in meaningful test-quality evaluation.
Problem

Research questions and friction points this paper is trying to address.

LLM
unit test
fault detection
Python bugs
test effectiveness
Innovation

Methods, ideas, or system contributions that make the work stand out.

retrieval-augmented test generation
fault detection
LLM-generated unit tests
coverage insufficiency
empirical benchmarking