Can Copulas Be Used for Feature Selection? A Machine Learning Study on Diabetes Risk Prediction

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Traditional feature selection methods (e.g., mutual information, genetic algorithms) overlook extreme dependence structures prevalent among high-risk diabetic subpopulations. To address this, we propose the first supervised feature selection framework grounded in the upper-tail dependence coefficient λ_U derived from the A2 cosine-type copula. Integrating extreme-value theory with machine learning, our approach pioneers the use of copula-based upper-tail dependence modeling for supervised feature selection, enabling precise identification of clinical variables exhibiting strong joint extreme-risk behavior within high-risk subgroups. Evaluated on a large-scale CDC health dataset (n = 253,680), the method selects five features with high λ_U—BMI and self-rated health being the primary drivers—achieving 86.5% accuracy and an AUC of 0.806 with XGBoost, matching the performance of full-feature models. This yields substantial gains in interpretability and robustness for high-risk subgroup prediction.

Technology Category

Application Category

📝 Abstract

Accurate diabetes risk prediction relies on identifying key features from complex health datasets, but conventional methods like mutual information (MI) filters and genetic algorithms (GAs) often overlook extreme dependencies critical for high-risk subpopulations. In this study we introduce a feature-selection framework using the upper-tail dependence coefficient ({lambda}U) of the novel A2 copula, which quantifies how often extreme higher values of a predictor co-occur with diabetes diagnoses (target variable). Applied to the CDC Diabetes Health Indicators dataset (n=253,680), our method prioritizes five predictors (self-reported general health, high blood pressure, body mass index, mobility limitations, and high cholesterol levels) based on upper tail dependencies. These features match or outperform MI and GA selected subsets across four classifiers (Random Forest, XGBoost, Logistic Regression, Gradient Boosting), achieving accuracy up to 86.5% (XGBoost) and AUC up to 0.806 (Gradient Boosting), rivaling the full 21-feature model. Permutation importance confirms clinical relevance, with BMI and general health driving accuracy. To our knowledge, this is the first work to apply a copula's upper-tail dependence for supervised feature selection, bridging extreme-value theory and machine learning to deliver a practical toolkit for diabetes prevention.

Problem

Research questions and friction points this paper is trying to address.

Identifying key features for diabetes risk prediction

Using copula's upper-tail dependence for feature selection

Improving accuracy with selected features in classifiers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses A2 copula's upper-tail dependence coefficient

Prioritizes features based on extreme dependencies

Achieves high accuracy with fewer features

🔎 Similar Papers

No similar papers found.

Authors to Follow