Comparative Evaluation of Explainable Machine Learning Versus Linear Regression for Predicting County-Level Lung Cancer Mortality Rate in the United States

📅 2025-11-01
🏛️ JCO Clinical Cancer Informatics
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Accurately predicting and interpreting drivers of county-level lung cancer mortality in the U.S. remains challenging due to complex, nonlinear, and spatially heterogeneous risk factors. Method: We systematically compare interpretable machine learning models—random forests (RF) and gradient boosting—with linear regression for predicting age-adjusted lung cancer mortality rates across U.S. counties. SHAP values quantify feature importance and nonlinearity, while Getis-Ord Gi* statistics identify spatial clusters of high mortality. Contribution/Results: RF achieves superior predictive performance (R² = 41.9%, RMSE = 12.8), significantly outperforming linear models. Smoking prevalence emerges as the most critical predictor, exhibiting strong nonlinearity; housing value and Hispanic population proportion also show significant nonlinear effects. Spatial analysis reveals statistically significant high-mortality clustering in the Midwest and South Atlantic regions. This study is the first to integrate interpretable ML with spatial statistics at the county level, overcoming restrictive linearity assumptions of conventional regression and providing actionable mechanistic insights for targeted public health interventions.

Technology Category

Application Category

📝 Abstract
PURPOSE Lung cancer (LC) is a leading cause of cancer-related mortality in the United States. Accurate prediction of LC mortality rates is crucial for guiding targeted interventions and addressing health disparities. Although traditional regression-based models have been commonly used, explainable machine learning models may offer enhanced predictive accuracy and deeper insights into the factors influencing LC mortality. METHODS This study applied three models—random forest (RF), gradient boosting regression (GBR), and linear regression (LR)—to predict county-level LC mortality rates across the United States. Model performance was evaluated using R-squared and root mean squared error (RMSE). Shapley Additive Explanations (SHAP) values were used to determine variable importance and their directional impact. Geographic disparities in LC mortality were analyzed through Getis-Ord (Gi*) hotspot analysis. RESULTS The RF model outperformed both GBR and LR, achieving an R2 value of 41.9% and an RMSE of 12.8. SHAP analysis identified smoking rate as the most important predictor, followed by median home value and the percentage of the Hispanic ethnic population. Spatial analysis revealed significant clusters of elevated LC mortality in the mid-eastern counties of the United States. CONCLUSION The RF model demonstrated superior predictive performance for LC mortality rates, emphasizing the critical roles of smoking prevalence, housing values, and the percentage of Hispanic ethnic population. These findings offer valuable actionable insights for designing targeted interventions, promoting screening, and addressing health disparities in regions most affected by LC in the United States.
Problem

Research questions and friction points this paper is trying to address.

Predict county-level lung cancer mortality rates in the US
Compare explainable machine learning with linear regression models
Identify key factors and geographic disparities in mortality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Used random forest and gradient boosting regression models
Applied SHAP values for variable importance and impact
Employed Getis-Ord hotspot analysis for geographic disparities