CopulaSMOTE: A Copula-Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address performance degradation in diabetes prediction caused by class imbalance, this paper proposes CopulaSMOTE—a novel oversampling method that, for the first time, incorporates the A2 copula to explicitly model nonlinear dependencies among features, thereby overcoming SMOTE’s limitation of ignoring inter-feature correlations. Within the SMOTE framework, CopulaSMOTE generates statistically rigorous synthetic minority-class samples using the A2 copula and evaluates performance with logistic regression, random forest, GBDT, and XGBoost. On the Pima Indians Diabetes Dataset, XGBoost combined with CopulaSMOTE significantly outperforms standard SMOTE: accuracy improves by 4.6%, precision by 15.6%, recall by 20.4%, F1-score by 18.2%, and AUC by 25.5%. McNemar’s test confirms the statistical significance of these gains. This work establishes a new paradigm for imbalanced learning that integrates theoretical rigor—rooted in copula theory—with empirical effectiveness.

Technology Category

Application Category

📝 Abstract
Diabetes mellitus poses a significant health risk, as nearly 1 in 9 people are affected by it. Early detection can significantly lower this risk. Despite significant advancements in machine learning for identifying diabetic cases, results can still be influenced by the imbalanced nature of the data. To address this challenge, our study considered copula-based data augmentation, which preserves the dependency structure when generating data for the minority class and integrates it with machine learning (ML) techniques. We selected the Pima Indian dataset and generated data using A2 copula, then applied four machine learning algorithms: logistic regression, random forest, gradient boosting, and extreme gradient boosting. Our findings indicate that XGBoost combined with A2 copula oversampling achieved the best performance improving accuracy by 4.6%, precision by 15.6%, recall by 20.4%, F1-score by 18.2% and AUC by 25.5% compared to the standard SMOTE method. Furthermore, we statistically validated our results using the McNemar test. This research represents the first known use of A2 copulas for data augmentation and serves as an alternative to the SMOTE technique, highlighting the efficacy of copulas as a statistical method in machine learning applications.
Problem

Research questions and friction points this paper is trying to address.

Addressing imbalanced data in diabetes prediction using CopulaSMOTE.
Preserving dependency structure in minority class data augmentation.
Improving ML performance via copula-based oversampling vs. SMOTE.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Copula-based oversampling for imbalanced data
A2 copula integration with XGBoost
Statistical validation via McNemar test
🔎 Similar Papers
No similar papers found.
A
Agnideep Aich
Department of Mathematics, University of Louisiana at Lafayette, Lafayette, Louisiana, USA.
Md Monzur Murshed
Md Monzur Murshed
Assistant Professor, Minnesota State University, Mankato
Statistical LearningHigh Dimensional Data AnalysisPredictive modelingMeta Analysis
Sameera Hewage
Sameera Hewage
West Liberty University
Nonparametric StatisticsStatistical Machine LearningDependence ModellingMathematical Biology
A
Amanda Mayeaux
Department of Kinesiology, University of Louisiana at Lafayette, Lafayette, Louisiana, USA.