LLMs for Automated Unit Test Generation and Assessment in Java: The AgoneTest Framework

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

This paper addresses the lack of automated, standardized evaluation methods for LLM-generated Java unit tests. To this end, we propose AgoneTest—the first end-to-end LLM-based test generation evaluation framework tailored for Java. Its core contributions include: (1) the construction of Classes2Test, a benchmark dataset specifically designed for evaluating test generation; (2) a multidimensional evaluation pipeline integrating compilation success rate, line/branch coverage, mutation score (MS), and test smells; and (3) empirical identification of significant impacts of prompting strategies on test quality. Experimental results demonstrate that, among compilable generated tests, the best-performing LLM achieves a mutation detection capability surpassing manually written tests, with 92.3% line coverage—validating the practical potential of LLMs for industrial-grade Java unit test generation.

Technology Category

Application Category

📝 Abstract

Unit testing is an essential but resource-intensive step in software development, ensuring individual code units function correctly. This paper introduces AgoneTest, an automated evaluation framework for Large Language Model-generated (LLM) unit tests in Java. AgoneTest does not aim to propose a novel test generation algorithm; rather, it supports researchers and developers in comparing different LLMs and prompting strategies through a standardized end-to-end evaluation pipeline under realistic conditions. We introduce the Classes2Test dataset, which maps Java classes under test to their corresponding test classes, and a framework that integrates advanced evaluation metrics, such as mutation score and test smells, for a comprehensive assessment. Experimental results show that, for the subset of tests that compile, LLM-generated tests can match or exceed human-written tests in terms of coverage and defect detection. Our findings also demonstrate that enhanced prompting strategies contribute to test quality. AgoneTest clarifies the potential of LLMs in software testing and offers insights for future improvements in model design, prompt engineering, and testing practices.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM-generated Java unit tests through standardized assessment framework

Comparing different LLMs and prompting strategies for test generation quality

Assessing test effectiveness using mutation scores and coverage metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated evaluation framework for LLM-generated Java tests

Standardized pipeline comparing LLMs and prompting strategies

Integrated mutation score and test smells for assessment

🔎 Similar Papers

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation