🤖 AI Summary
This study addresses the limitation of the OpenSSF Scorecard’s Maintained metric, which reflects only recent 90-day activity and thus lacks predictive capability for deprecation risk in open-source dependencies. To enable proactive risk assessment, this work formulates Maintained score prediction as a multivariate time series task, leveraging three years of historical data from 3,220 PyPI core libraries and their associated GitHub repositories. The authors propose four target representations and systematically evaluate VARMA, Random Forest, and LSTM models across varying training windows (3–12 months) and forecast horizons (1–6 months). Experimental results demonstrate that simple models can match or exceed the performance of deep learning approaches, achieving classification accuracies above 0.95 for maintenance status levels and over 0.80 for trend types, thereby offering a practical foundation for anticipating open-source project maintenance risks.
📝 Abstract
The OpenSSF Scorecard is widely used to assess the security posture of open-source software repositories, with the Maintained metric indicating recent development activity and helping identify potentially abandoned dependencies. However, this metric is inherently retrospective, reflecting only the past 90 days of activity and providing no insight into future maintenance, which limits its usefulness for proactive risk assessment. In this paper, we study to what extent future maintenance activity, as captured by the OpenSSF Maintained score, can be forecasted. We analyze 3,220 GitHub repositories associated with the top 1% most central PyPI libraries by PageRank and reconstruct historical Maintained scores over a three-year period. We formulate the task as multivariate time series forecasting and consider four target representations: raw scores, bucketed maintenance levels, numerical trend slopes, and categorical trend types. We compare a statistical model (VARMA), a machine learning model (Random Forest), and a deep learning model (LSTM) across training windows of 3-12 months and forecasting horizons of 1-6 months. Our results show that future maintenance activity can be predicted with meaningful accuracy, particularly for aggregated representations such as bucketed scores and trend types, achieving accuracies above 0.95 and 0.80, respectively. Simpler statistical and machine learning models perform on par with deep learning approaches, indicating that complex architectures are not required. These findings suggest that predictive modeling can effectively complement existing Scorecard metrics, enabling more proactive assessment of open-source maintenance risks.