Imputation of Unknown Missingness in Sparse Electronic Health Records

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This study addresses the pervasive “missing-not-at-random” problem in electronic health records (EHRs), where it is inherently ambiguous whether a medical event was truly absent or merely unrecorded—a challenge that conventional imputation methods struggle to resolve due to its “unknown unknowns” nature. The work formally articulates and models this issue for the first time, proposing a Transformer-based denoising neural network equipped with an adaptive thresholding mechanism to effectively recover sparse binary EHR data. The method explicitly distinguishes between unrecorded and genuinely absent events, significantly outperforming existing imputation approaches on real-world EHR datasets. Furthermore, it yields statistically significant improvements in downstream clinical prediction tasks, such as hospital readmission forecasting, demonstrating its practical utility in enhancing EHR data quality and predictive modeling.

Technology Category

Application Category

📝 Abstract

Machine learning holds great promise for advancing the field of medicine, with electronic health records (EHRs) serving as a primary data source. However, EHRs are often sparse and contain missing data due to various challenges and limitations in data collection and sharing between healthcare providers. Existing techniques for imputing missing values predominantly focus on known unknowns, such as missing or unavailable values of lab test results; most do not explicitly address situations where it is difficult to distinguish what is missing. For instance, a missing diagnosis code in an EHR could signify either that the patient has not been diagnosed with the condition or that a diagnosis was made, but not shared by a provider. Such situations fall into the paradigm of unknown unknowns. To address this challenge, we develop a general purpose algorithm for denoising data to recover unknown missing values in binary EHRs. We design a transformer-based denoising neural network where the output is thresholded adaptively to recover values in cases where we predict data are missing. Our results demonstrate improved accuracy in denoising medical codes within a real EHR dataset compared to existing imputation approaches and leads to increased performance on downstream tasks using the denoised data. In particular, when applying our method to a real world application, predicting hospital readmission from EHRs, our method achieves statistically significant improvement over all existing baselines.

Problem

Research questions and friction points this paper is trying to address.

missing data

unknown unknowns

electronic health records

data sparsity

imputation

Innovation

Methods, ideas, or system contributions that make the work stand out.

unknown missingness

EHR imputation

transformer-based denoising