🤖 AI Summary
Medical AI models deployed clinically often exhibit substantial performance disparities across patient subgroups (e.g., race, sex, socioeconomic status) due to noisy, imbalanced, and incomplete training data—exacerbating health inequities. To address this, we propose a “subgroup-sensitive” paradigm for medical AI development that integrates fairness throughout the modeling lifecycle, tightly coupling transparency with accountability and shifting evaluation from aggregate accuracy to subgroup-aware decision frameworks. Leveraging a multitask ICU prediction and diagnosis benchmark, we conduct subgroup decomposition analysis, bias attribution visualization, and clinical feasibility assessment across multiple real-world datasets. Our analysis reveals significant subgroup performance gaps (AUC differences exceeding 0.25). We introduce actionable risk-alert metrics and develop the first operational framework for pre-deployment fairness review—comprising standardized assessment protocols, interpretable diagnostics, and clinical validation criteria—to support equitable, deployable AI in healthcare.
📝 Abstract
Machine learning (ML) models are increasingly used to support clinical decision-making. However, real-world medical datasets are often noisy, incomplete, and imbalanced, leading to performance disparities across patient subgroups. These differences raise fairness concerns, particularly when they reinforce existing disadvantages for marginalized groups. In this work, we analyze several medical prediction tasks and demonstrate how model performance varies with patient characteristics. While ML models may demonstrate good overall performance, we argue that subgroup-level evaluation is essential before integrating them into clinical workflows. By conducting a performance analysis at the subgroup level, differences can be clearly identified-allowing, on the one hand, for performance disparities to be considered in clinical practice, and on the other hand, for these insights to inform the responsible development of more effective models. Thereby, our work contributes to a practical discussion around the subgroup-sensitive development and deployment of medical ML models and the interconnectedness of fairness and transparency.