Benchmarking LLMs for Unit Test Generation from Real-World Functions

📅 2025-08-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing LLM test-generation benchmarks suffer from data contamination and oversimplified function structures, undermining the validity and generalizability of empirical findings. To address these issues, we propose ULT—the first contamination-free, high-cyclomatic-complexity unit-testing benchmark targeting real-world Python functions—and introduce PLT, the first benchmark enabling decoupled evaluation of LLMs’ memorization versus reasoning capabilities. ULT ensures task authenticity and evaluation reliability through multi-stage filtering, cyclomatic complexity constraints, and rigorous leakage detection. Experimental results reveal a substantial performance drop for mainstream LLMs on ULT: average functional correctness is only 41.32%, with statement and branch coverage at 45.10% and 30.22%, respectively, and mutation score at 40.21%. These findings expose fundamental limitations of current LLMs in realistic software testing scenarios.

Technology Category

Application Category

📝 Abstract

Recently, large language models (LLMs) have shown great promise in automating unit test generation, significantly reducing the manual effort required by developers. To effectively evaluate the capabilities of LLMs in this domain, it is crucial to have a well-designed benchmark that accurately reflects real-world scenarios and mitigates common pitfalls. Existing LLM test generation benchmarks are limited by two critical drawbacks: data contamination and structurally simple function code. As a result, we often cannot rely on the validity of scientific conclusions drawn from empirical studies using these limited benchmarks. The empirical evidence presented may be biased due to contamination and may fail to generalize beyond toy programs due to structural simplicity. To address these problems, we introduce ULT (UnLeakedTestbench), a new benchmark specifically designed for function-level unit test generation from real-world Python functions. ULT is constructed through a multi-stage curation process that ensures high cyclomatic complexity and mitigates test case contamination. With 3,909 carefully selected function-level tasks, ULT provides a more realistic and challenging evaluation of LLMs' test generation capabilities. We also provide PLT (PreLeakedTestbench), a pair benchmark of ULT with leaked tests designed to enable a controlled analysis of memorization versus reasoning in test generation. Our evaluation results demonstrate that ULT is significantly more challenging. For example, test cases generated by LLMs only achieve 41.32%, 45.10%, 30.22%, and 40.21% for accuracy, statement coverage, branch coverage, and mutation score on average for all LLMs, respectively. These results are substantially lower than the corresponding metrics on TestEval (91.79%, 92.18%, 82.04%, and 49.69%) and PLT (47.07%, 55.13%, 40.07%, and 50.80%).

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for real-world unit test generation

Addressing data contamination in test benchmarks

Overcoming structural simplicity in function code benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

ULT benchmark for real-world Python functions

Multi-stage curation ensures high complexity

PLT benchmark analyzes memorization versus reasoning

🔎 Similar Papers

No similar papers found.

Authors to Follow