🤖 AI Summary
This study investigates whether pre-release internal software metrics can predict post-deployment popularity—measured by user ratings and annual download counts—of mobile applications.
Method: Leveraging 446 open-source Android apps, we extract multi-granular static features—including system-, class-, and method-level code metrics, code smells, permission usage, and metadata—and employ both regression and binary classification modeling. Regression yields limited predictive performance; thus, we recast the task as a high/low popularity binary classification. We optimize feature input via ensemble-based feature selection and apply a multilayer perceptron (MLP) classifier.
Contribution/Results: Our approach achieves an F1-score of 0.72, demonstrating for the first time that pre-release static code characteristics are systematically predictive of app popularity. This challenges the conventional assumption that internal metrics poorly reflect user-perceived quality, and establishes a novel paradigm for early-stage quality assessment and market performance forecasting.
📝 Abstract
Predicting mobile app popularity before release can provide developers with a strategic advantage in a competitive marketplace, yet it remains a challenging problem. This study explores whether internal software metrics, measurable from source code before deployment, can predict an app's popularity, defined by user ratings (calculated from user reviews) and DownloadsPerYear (yearly downloads). Using a dataset of 446 open-source Android apps from F-Droid, we extract a wide array of features, including system-, class-, and method-level code metrics, code smells, and app metadata. Additional information, such as user reviews, download counts, and uses-permission, was collected from the Google Play Store. We evaluate regression and classification models across three feature sets: a minimal Size-only baseline, a domain-informed Handpicked set, and a Voting set derived via feature selection algorithms. Regression models perform poorly due to skewed data, with low $R^2$ scores. However, when reframed as binary classification (Popular vs. Unpopular), results improve significantly. The best model, a Multilayer Perceptron using the Voting set, achieves F1-scores of 0.72. These results suggest that internal code metrics, although limited in their explanatory power, can serve as useful indicators of app popularity. This challenges earlier findings that dismissed internal metrics as predictors of software quality.