ASTER: Natural and Multi-language Unit Test Generation with LLMs

📅 2024-09-04
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
To address critical challenges in multilingual (Java/Python) unit test generation—including poor test readability, low coverage, and frequent compilation failures—this paper proposes a static-analysis-driven large language model (LLM) testing paradigm. Our method integrates multilingual abstract syntax tree (AST) parsing, environment mocking, coverage-guided prompt engineering, and static analysis feedback to impose structured constraints on LLM outputs and enable iterative refinement. Evaluated on large-scale industrial Java/Python codebases and standard benchmarks, our approach achieves branch coverage competitive with or surpassing state-of-the-art (SOTA) tools; notably, it is the first to demonstrate effectiveness in Python. A user study (N=161) confirms that generated tests are significantly more natural, readable, and stylistically consistent with human-written tests.

Technology Category

Application Category

📝 Abstract
Implementing automated unit tests is an important but time-consuming activity in software development. To assist developers in this task, many techniques for automating unit test generation have been developed. However, despite this effort, usable tools exist for very few programming languages. Moreover, studies have found that automatically generated tests suffer poor readability and do not resemble developer-written tests. In this work, we present a rigorous investigation of how large language models (LLMs) can help bridge the gap. We describe a generic pipeline that incorporates static analysis to guide LLMs in generating compilable and high-coverage test cases. We illustrate how the pipeline can be applied to different programming languages, specifically Java and Python, and to complex software requiring environment mocking. We conducted an empirical study to assess the quality of the generated tests in terms of code coverage and test naturalness -- evaluating them on standard as well as enterprise Java applications and a large Python benchmark. Our results demonstrate that LLM-based test generation, when guided by static analysis, can be competitive with, and even outperform, state-of-the-art test-generation techniques in coverage achieved while also producing considerably more natural test cases that developers find easy to understand. We also present the results of a user study, conducted with 161 professional developers, that highlights the naturalness characteristics of the tests generated by our approach.
Problem

Research questions and friction points this paper is trying to address.

Automated Unit Testing
Multi-language Environment
Low Coverage
Innovation

Methods, ideas, or system contributions that make the work stand out.

ASTER
Large Language Models
Static Analysis for Code Coverage
🔎 Similar Papers
No similar papers found.
Rangeet Pan
Rangeet Pan
Staff Research Scientist, IBM Research, Yorktown Heights
Software EngineeringProgramming LanguageLarge Language Models
M
Myeongsoo Kim
Georgia Tech, Atlanta, GA, 30332, USA
R
Rahul Krishna
IBM Research, Yorktown Heights, NY, 10598, USA
R
Raju Pavuluri
IBM Research, Yorktown Heights, NY, 10598, USA
S
Saurabh Sinha
IBM Research, Yorktown Heights, NY, 10598, USA