🤖 AI Summary
To address the lack of predictive uncertainty quantification in deep learning models for clinical deployment, this paper proposes a high-diversity 9-member deep ensemble framework integrated with Monte Carlo Dropout, enabling fine-grained decomposition of uncertainty—aleatoric and epistemic—on the NIH ChestX-ray14 multi-label thoracic disease diagnosis task. The method substantially improves model calibration and interpretability: Expected Calibration Error (ECE) drops to 0.0728, Negative Log-Likelihood (NLL) to 0.1916, and mean epistemic uncertainty to 0.0240. It achieves an average AUROC of 0.8559 and F1-score of 0.3857, attaining state-of-the-art performance and reliability. This work represents the first systematic realization of high-accuracy uncertainty modeling for multi-label chest X-ray diagnosis, delivering a clinically deployable technical pathway for trustworthy decision support systems.
📝 Abstract
The utility of deep learning models, such as CheXNet, in high stakes clinical settings is fundamentally constrained by their purely deterministic nature, failing to provide reliable measures of predictive confidence. This project addresses this critical gap by integrating robust Uncertainty Quantification (UQ) into a high performance diagnostic platform for 14 common thoracic diseases on the NIH ChestX-ray14 dataset. Initial architectural development failed to stabilize performance and calibration using Monte Carlo Dropout (MCD), yielding an unacceptable Expected Calibration Error (ECE) of 0.7588. This technical failure necessitated a rigorous architectural pivot to a high diversity, 9-member Deep Ensemble (DE). This resulting DE successfully stabilized performance and delivered superior reliability, achieving a State-of-the-Art (SOTA) average Area Under the Receiver Operating Characteristic Curve (AUROC) of 0.8559 and an average F1 Score of 0.3857. Crucially, the DE demonstrated superior calibration (Mean ECE of 0.0728 and Negative Log-Likelihood (NLL) of 0.1916) and enabled the reliable decomposition of total uncertainty into its Aleatoric (irreducible data noise) and Epistemic (reducible model knowledge) components, with a mean Epistemic Uncertainty (EU) of 0.0240. These results establish the Deep Ensemble as a trustworthy and explainable platform, transforming the model from a probabilistic tool into a reliable clinical decision support system.