UTFix: Change Aware Unit Test Repairing using LLM

📅 2025-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In software evolution, unit tests frequently break due to changes in the tested methods—manifesting as assertion failures or reduced coverage—thereby compromising system reliability. This paper presents the first systematic study of automatic unit test repair for Python projects under evolutionary changes. We propose a context-aware, large language model (LLM)-based repair method that integrates static and dynamic code slicing with failure-specific assertion information to construct precise, minimal repair contexts. Additionally, we introduce Tool-Bench, the first synthetic benchmark specifically designed for evaluating test evolution repair. Experimental results show that our approach achieves an 89.2% assertion failure repair rate on Tool-Bench, with full coverage restoration in 96 out of 369 tests; on real-world project benchmarks, it repairs 60% of assertion failures and fully restores coverage in 19 out of 30 tests. Key contributions include: (1) the first end-to-end framework for test evolution repair, (2) the first dedicated synthetic benchmark (Tool-Bench), and (3) a context-enhanced, LLM-driven repair paradigm.

Technology Category

Application Category

📝 Abstract
Software updates, including bug repair and feature additions, are frequent in modern applications but they often leave test suites outdated, resulting in undetected bugs and increased chances of system failures. A recent study by Meta revealed that 14%-22% of software failures stem from outdated tests that fail to reflect changes in the codebase. This highlights the need to keep tests in sync with code changes to ensure software reliability. In this paper, we present UTFix, a novel approach for repairing unit tests when their corresponding focal methods undergo changes. UTFix addresses two critical issues: assertion failure and reduced code coverage caused by changes in the focal method. Our approach leverages language models to repair unit tests by providing contextual information such as static code slices, dynamic code slices, and failure messages. We evaluate UTFix on our generated synthetic benchmarks (Tool-Bench), and real-world benchmarks. Tool- Bench includes diverse changes from popular open-source Python GitHub projects, where UTFix successfully repaired 89.2% of assertion failures and achieved 100% code coverage for 96 tests out of 369 tests. On the real-world benchmarks, UTFix repairs 60% of assertion failures while achieving 100% code coverage for 19 out of 30 unit tests. To the best of our knowledge, this is the first comprehensive study focused on unit test in evolving Python projects. Our contributions include the development of UTFix, the creation of Tool-Bench and real-world benchmarks, and the demonstration of the effectiveness of LLM-based methods in addressing unit test failures due to software evolution.
Problem

Research questions and friction points this paper is trying to address.

Repair outdated unit tests after code changes
Address assertion failures and code coverage issues
Leverage language models for unit test repair
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages language models for unit test repair
Uses static and dynamic code slices for context
Achieves high repair rates and code coverage
🔎 Similar Papers
No similar papers found.