🤖 AI Summary
Current evaluation of medical time-series imputation models relies heavily on the random missingness assumption, overlooking clinically prevalent non-random, structured missingness patterns—leading to assessments that poorly reflect real-world clinical utility. Method: Leveraging the PhysioNet Challenge 2012 dataset, we systematically benchmark 11 state-of-the-art imputation methods—including RNN-, GAN-, and Transformer-based approaches—and introduce a clinically informed masking strategy to jointly evaluate imputation accuracy and downstream mortality prediction performance. Contributions/Results: (1) We provide the first empirical evidence that imputation accuracy does not necessarily correlate with clinical prediction AUC; several high-accuracy models fail to improve—or even degrade—mortality prediction; (2) RNN-based models demonstrate superior robustness under structured missingness; (3) Optimized clinical masking improves mortality prediction AUC by up to 3.2%. This work shifts imputation evaluation from a purely technical paradigm toward one grounded in clinical utility.
📝 Abstract
This study investigates the impact of masking strategies on time series imputation models in healthcare settings. While current approaches predominantly rely on random masking for model evaluation, this practice fails to capture the structured nature of missing patterns in clinical data. Using the PhysioNet Challenge 2012 dataset, we analyse how different masking implementations affect both imputation accuracy and downstream clinical predictions across eleven imputation methods. Our results demonstrate that masking choices significantly influence model performance, while recurrent architectures show more consistent performance across strategies. Analysis of downstream mortality prediction reveals that imputation accuracy doesn't necessarily translate to optimal clinical prediction capabilities. Our findings emphasise the need for clinically-informed masking strategies that better reflect real-world missing patterns in healthcare data, suggesting current evaluation frameworks may need reconsideration for reliable clinical deployment.