🤖 AI Summary
This study addresses the challenge of predicting residual defects in Python systems, whose dynamic nature complicates testing and often leads to post-release bugs. The authors construct a cross-project dataset encompassing 83 product, process, statistical, and Python-specific metrics alongside normalized code representations, enabling the first systematic evaluation of supervised learning for residual defect prediction. By integrating discovery-process metrics with code embeddings—shown to be complementary rather than redundant in representation space—and leveraging models such as RandomForest, XGBoost, and CatBoost combined with principal component analysis, the approach achieves recall rates of 0.85–0.9 across multiple projects, reducing false negatives by an order of magnitude. Key predictors include module age, code change frequency, and developer activity levels.
📝 Abstract
Python's dynamic nature complicates testing and increases the possibility that some defects evade detection, so an effective fault prediction becomes essential. We examine whether post-release faults can be predicted using modern ML and DL. Using a balanced dataset of over 4,000 labeled faults with 83 product, process, statistical, and Python-specific metrics plus normalized code representations, we conduct cross-project experiments. LLMs and unsupervised models fail to distinguish residual from non-residual faults, while supervised metric-based models (RandomForest, XGBoost, CatBoost) perform far better, yielding a 0.85-0.9 recall and cutting false negatives by an order of magnitude. Process metrics, especially age, churn, and developer-activity, alongside class and file size, consistently prove most predictive. Notably, the Principal Component Analysis shows that metrics and code embeddings occupy distinct regions of the representation space, suggesting that they capture complementary rather than redundant information.