π€ AI Summary
This study systematically identifies four critical limitations of public chest X-ray datasets (e.g., MIMIC-CXR, CheXpert) in AI applications: (1) high label noise due to automated labelingβs inability to capture negation and uncertainty, leading to substantial disagreement with expert annotations; (2) poor cross-dataset generalization stemming from strong domain shifts; (3) inadequate population representativeness, resulting in degraded performance for underrepresented age/gender subgroups; and (4) weak clinical relevance of standard evaluations. To address these, we propose the first cross-dataset domain shift quantification framework, leveraging source classifiers to detect dataset-level bias, and introduce radiologist double-blind annotation to rigorously validate label quality. Experiments show a mean 21.3% drop in AUPRC on external cohorts; source classifier accuracy reaches 98.7%; expert label corrections exceed 35%; and subgroup analysis confirms significant fairness deficits. Our work advances a clinically trustworthy, fair, and generalizable evaluation paradigm for chest X-ray AI.
π Abstract
Artificial intelligence has shown significant promise in chest radiography, where deep learning models can approach radiologist-level diagnostic performance. Progress has been accelerated by large public datasets such as MIMIC-CXR, ChestX-ray14, PadChest, and CheXpert, which provide hundreds of thousands of labelled images with pathology annotations. However, these datasets also present important limitations. Automated label extraction from radiology reports introduces errors, particularly in handling uncertainty and negation, and radiologist review frequently disagrees with assigned labels. In addition, domain shift and population bias restrict model generalisability, while evaluation practices often overlook clinically meaningful measures. We conduct a systematic analysis of these challenges, focusing on label quality, dataset bias, and domain shift. Our cross-dataset domain shift evaluation across multiple model architectures revealed substantial external performance degradation, with pronounced reductions in AUPRC and F1 scores relative to internal testing. To assess dataset bias, we trained a source-classification model that distinguished datasets with near-perfect accuracy, and performed subgroup analyses showing reduced performance for minority age and sex groups. Finally, expert review by two board-certified radiologists identified significant disagreement with public dataset labels. Our findings highlight important clinical weaknesses of current benchmarks and emphasise the need for clinician-validated datasets and fairer evaluation frameworks.