Predicting Hospitalization from a Whole-Person Health Score with Incomplete Electronic Health Records Data: A Case Study

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the pervasive issue of missing health indicators in electronic health records (EHRs) by proposing a hospitalization risk prediction approach tailored to distinct missingness patterns, using EHR data from 1,000 patients. The method employs multiple imputation strategies to handle missing components of the Albumin–Lactate Index (ALI), followed by separate modeling via logistic regression and random forest, adjusting for age and sex. Results demonstrate that pattern-specific submodels achieve an in-sample AUC of 0.73 and a cross-validated AUC of 0.63, with logistic regression outperforming random forest. Notably, modeling based on individual ALI components significantly surpasses conventional approaches relying on aggregated summary metrics, offering an empirically validated pathway toward embedding dynamic health scores into EHR systems.

📝 Abstract

Embedding a standardized whole-person health measure in electronic health records (EHR) could be instrumental to preventative care. The allostatic load index (ALI), calculated from ten component stressors across three body systems, offers a promising snapshot of holistic health. The ALI can be calculated from EHR data, but many components are missing, since not all patients undergo all tests. Using statistical modeling and machine learning, EHR data for $1000$ patients from a large academic health system were used to predict in-patient hospitalization (as a count or binary) from ALI, controlling for age and sex. Various methods were evaluated to fill in information gaps for patients' missing ALI components, including summary measures combining components or using them separately. Performance was measured using receiver operating characteristic (ROC) curves and corresponding areas under the ROC curve (AUC). Count modeling of hospitalization did not improve upon binary, and logistic regression beat random forest. Overall, summary measures performed similarly, with the complete-case proportion (i.e., the proportion of non-missing components that were "unhealthy") performing best (AUC $= 0.64$) but by $\leq 0.01$. When using components separately, the pattern submodel approach most accurately predicted hospitalization (AUC $= 0.73$) in sample, but did not cross-validate as well (AUC $= 0.63$). All summary measures performed similarly. However, when including the ALI components separately, tailoring models to subsets of patients with the same missing data pattern performed best. Next steps include EHR implementation to enable prediction and support clinician decision-making at scale.

Problem

Research questions and friction points this paper is trying to address.

hospitalization prediction

incomplete EHR data

allostatic load index

whole-person health score

missing data

Innovation

Methods, ideas, or system contributions that make the work stand out.

allostatic load index

missing data pattern

pattern submodel