🤖 AI Summary
To address performance degradation in diabetes prediction caused by class imbalance, this paper proposes CopulaSMOTE—a novel oversampling method that, for the first time, incorporates the A2 copula to explicitly model nonlinear dependencies among features, thereby overcoming SMOTE’s limitation of ignoring inter-feature correlations. Within the SMOTE framework, CopulaSMOTE generates statistically rigorous synthetic minority-class samples using the A2 copula and evaluates performance with logistic regression, random forest, GBDT, and XGBoost. On the Pima Indians Diabetes Dataset, XGBoost combined with CopulaSMOTE significantly outperforms standard SMOTE: accuracy improves by 4.6%, precision by 15.6%, recall by 20.4%, F1-score by 18.2%, and AUC by 25.5%. McNemar’s test confirms the statistical significance of these gains. This work establishes a new paradigm for imbalanced learning that integrates theoretical rigor—rooted in copula theory—with empirical effectiveness.
📝 Abstract
Diabetes mellitus poses a significant health risk, as nearly 1 in 9 people are affected by it. Early detection can significantly lower this risk. Despite significant advancements in machine learning for identifying diabetic cases, results can still be influenced by the imbalanced nature of the data. To address this challenge, our study considered copula-based data augmentation, which preserves the dependency structure when generating data for the minority class and integrates it with machine learning (ML) techniques. We selected the Pima Indian dataset and generated data using A2 copula, then applied four machine learning algorithms: logistic regression, random forest, gradient boosting, and extreme gradient boosting. Our findings indicate that XGBoost combined with A2 copula oversampling achieved the best performance improving accuracy by 4.6%, precision by 15.6%, recall by 20.4%, F1-score by 18.2% and AUC by 25.5% compared to the standard SMOTE method. Furthermore, we statistically validated our results using the McNemar test. This research represents the first known use of A2 copulas for data augmentation and serves as an alternative to the SMOTE technique, highlighting the efficacy of copulas as a statistical method in machine learning applications.