Conformal Risk Prediction for Non-Alcoholic Fatty Liver Disease Using Gradient Boosting with Distribution-Free Coverages

📅 2026-05-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of reliable individual risk calibration and coverage guarantees in existing screening tools for non-alcoholic fatty liver disease (NAFLD). The authors propose a novel NAFLD risk prediction model that integrates conformal prediction with gradient-boosted decision trees, offering distribution-free coverage guarantees. To enhance interpretability, they incorporate mutual information–based stability selection to identify robust clinical features. The model achieves AUROCs of 0.912 and 0.891 in internal and external validation cohorts, respectively. At a nominal 90% coverage level, it attains an empirical coverage of 91.3%, and individuals classified as high-risk exhibit a 4.7-fold higher rate of disease progression over 12 months compared to the low-risk group, thereby enabling precise, interpretable, three-tier risk stratification.
📝 Abstract
Non-alcoholic fatty liver disease (NAFLD) affects roughly 25% of global adults, posing substantial hepatic and cardiovascular risks. Yet, population-level screening tools remain inadequate. We present Method, a machine-learning framework for NAFLD risk prediction coupling gradient-boosted decision trees with conformal prediction to yield calibrated, distribution-free coverage guarantees on individual risk estimates. It integrates a mutual-information-based stability selection procedure to identify a compact, clinically interpretable feature subset via bootstrap resampling, constructing prediction sets whose marginal coverage provably exceeds a user-specified confidence level. We evaluated Method on a multicenter cohort from Guangzhou, China (primary n=2,187; external validation n=412) using 78 candidate features across demographics, metabolic biomarkers, and lifestyle factors. Method achieves an AUROC of 0.912 internally and 0.891 externally, outperforming deep neural networks, TabNet, support vector machines, and logistic regression. Conformal prediction sets achieve 91.3% empirical coverage at the 90% nominal level. A three-tier risk stratification derived from these scores separates the population into distinct groups, with the high-risk subgroup showing a 12-month progression rate 4.7 times that of the low-risk tier. The selected features -- notably waist circumference, ALT, GGT, triglycerides, fasting glucose, and BMI -- align with established metabolic risk factors, providing biological plausibility.
Problem

Research questions and friction points this paper is trying to address.

Non-alcoholic Fatty Liver Disease
Risk Prediction
Population Screening
Conformal Prediction
Machine Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conformal Prediction
Gradient Boosting
Distribution-Free Coverage
Stability Selection
Risk Stratification
🔎 Similar Papers