Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation

📅 2024-06-28

🏛️ arXiv.org

📈 Citations: 8

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Prior work on LLM-based Java unit test generation lacks execution-driven evaluation, real-world validation, and maintainability analysis. Method: This study conducts the first large-scale, independent, multidimensional assessment of LLMs—evaluating four instruction-tuned models (including GPT-4 and Mistral) and five prompting strategies across 690 real-world Java classes, generating 216,000 test cases. We propose a novel evaluation framework integrating correctness, readability, code coverage (via JaCoCo), and test smell detection (via static analysis). Contribution/Results: LLM-generated tests significantly outperform EvoSuite in semantic correctness and readability, achieving +12% average line coverage; however, branch coverage remains suboptimal. Common test smells—including duplicate assertions and hardcoded values—are prevalent. Our work establishes the first empirical benchmark for LLM-driven test automation and identifies concrete avenues for improvement.

Technology Category

Application Category

📝 Abstract

Unit testing, crucial for ensuring the reliability of code modules, such as classes and methods, is often overlooked by developers due to time constraints. Automated test generation techniques have emerged to address this, but they frequently lack readability and require significant developer intervention. Large Language Models (LLMs), such as GPT and Mistral, have shown promise in software engineering tasks, including test generation, but their overall effectiveness remains unclear. This study presents an extensive investigation of LLMs, evaluating the effectiveness of four models and five prompt engineering techniques for unit test generation. We analyze 216 300 tests generated by the selected advanced instruct-tuned LLMs for 690 Java classes collected from diverse datasets. Our evaluation considers correctness, understandability, coverage, and test smell detection in the generated tests, comparing them to a widely used automated testing tool, EvoSuite. While LLMs demonstrate potential, improvements in test quality particularly in reducing common test smells are necessary. This study highlights the strengths and limitations of LLM-generated tests compared to traditional methods, paving the way for further research on LLMs in test automation.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs for generating reliable unit tests at class level

Compares LLM-generated tests with search-based methods across datasets

Analyzes prompting techniques' impact on test quality and reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale evaluation of LLM-generated unit tests

Systematic analysis of four state-of-the-art models

Five prompting techniques assessed for test generation

🔎 Similar Papers

TESTEVAL: Benchmarking Large Language Models for Test Case Generation