Bogus Bugs, Duplicates, and Revealing Comments: Data Quality Issues in NPR

📅 2025-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies pervasive data quality issues in mainstream Automatic Program Repair (APR) benchmarks—including bogus bugs, sample duplication, and leaky annotations—that lead to severe overestimation of model performance. To address this, the authors systematically characterize and name canonical data contamination patterns in APR for the first time, and propose a data-centric evaluation and purification framework encompassing data provenance analysis, redundancy detection, semantic consistency verification, manual annotation validation, and benchmark re-cleaning. Experimental results reveal that up to 12% of reported defects across multiple public datasets are spurious, while widespread duplication inflates repair accuracy by over 23%. The work releases an open-source, reproducible toolkit for diagnosing APR dataset quality, advocating a paradigm shift in APR research from model-centricity to data trustworthiness as the primary concern.

Technology Category

Application Category

📝 Abstract
The performance of a machine learning system is not only determined by the model but also, to a substantial degree, by the data it is trained on. With the increasing use of machine learning, issues related to data quality have become a concern also in automated program repair research. In this position paper, we report some of the data-related issues we have come across when working with several large APR datasets and benchmarks, including, for instance, duplicates or"bogus bugs". We briefly discuss the potential impact of these problems on repair performance and propose possible remedies. We believe that more data-focused approaches could improve the performance and robustness of current and future APR systems.
Problem

Research questions and friction points this paper is trying to address.

Identifies data quality issues in automated program repair (APR) datasets.
Highlights problems like duplicates and bogus bugs affecting repair performance.
Proposes data-focused solutions to enhance APR system robustness.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies data quality issues in APR datasets
Proposes remedies for duplicates and bogus bugs
Advocates data-focused approaches for APR systems
🔎 Similar Papers
No similar papers found.