π€ AI Summary
This study addresses widespread concerns regarding methodological rigor and reproducibility in software defect prediction research, where inadequate experimental design and insufficient reporting severely undermine the credibility of findings. Conducting the first large-scale systematic audit of 101 papers published between 2019 and 2023, we employed bibliometric analysis, a structured experimental design evaluation framework, and the reproducibility assessment tool by GonzΓ‘lez-Barahona and Robles to evaluate compliance with best practices in statistical methods, machine learning implementation, and result reporting. Our analysis reveals that papers exhibit an average of four methodological flaws each, with only one study fully adhering to established standards. Nearly half of the examined works omit critical details necessary for replication, and preliminary evidence suggests potential involvement of paper mill activity. These findings provide empirical grounding and actionable directions for enhancing research rigor in the field.
π Abstract
Background: Machine learning algorithms are widely used to predict defect prone software components. In this literature, computational experiments are the main means of evaluation, and the credibility of results depends on experimental design and reporting. Objective: This paper audits recent software defect prediction (SDP) studies by assessing their experimental design, analysis, and reporting practices against accepted norms from statistics, machine learning, and empirical software engineering. The aim is to characterise current practice and assess the reproducibility of published results. Method: We audited SDP studies indexed in SCOPUS between 2019 and 2023, focusing on design and analysis choices such as outcome measures, out of sample validation strategies, and the use of statistical inference. Nine study issues were evaluated. Reproducibility was assessed using the instrument proposed by Gonz\'alez Barahona and Robles. Results: The search identified approximately 1,585 SDP experiments published during the period. From these, we randomly sampled 101 papers, including 61 journal and 40 conference publications, with almost 50 percent behind paywalls. We observed substantial variation in research practice. The number of datasets ranged from 1 to 365, learners or learner variants from 1 to 34, and performance measures from 1 to 9. About 45 percent of studies applied formal statistical inference. Across the sample, we identified 427 issues, with a median of four per paper, and only one paper without issues. Reproducibility ranged from near complete to severely limited. We also identified two cases of tortured phrases and possible paper mill activity. Conclusions: Experimental design and reporting practices vary widely, and almost half of the studies provide insufficient detail to support reproduction. The audit indicates substantial scope for improvement.