Enhancing Multi-Label Thoracic Disease Diagnosis with Deep Ensemble-Based Uncertainty Quantification

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of predictive uncertainty quantification in deep learning models for clinical deployment, this paper proposes a high-diversity 9-member deep ensemble framework integrated with Monte Carlo Dropout, enabling fine-grained decomposition of uncertainty—aleatoric and epistemic—on the NIH ChestX-ray14 multi-label thoracic disease diagnosis task. The method substantially improves model calibration and interpretability: Expected Calibration Error (ECE) drops to 0.0728, Negative Log-Likelihood (NLL) to 0.1916, and mean epistemic uncertainty to 0.0240. It achieves an average AUROC of 0.8559 and F1-score of 0.3857, attaining state-of-the-art performance and reliability. This work represents the first systematic realization of high-accuracy uncertainty modeling for multi-label chest X-ray diagnosis, delivering a clinically deployable technical pathway for trustworthy decision support systems.

Technology Category

Application Category

📝 Abstract
The utility of deep learning models, such as CheXNet, in high stakes clinical settings is fundamentally constrained by their purely deterministic nature, failing to provide reliable measures of predictive confidence. This project addresses this critical gap by integrating robust Uncertainty Quantification (UQ) into a high performance diagnostic platform for 14 common thoracic diseases on the NIH ChestX-ray14 dataset. Initial architectural development failed to stabilize performance and calibration using Monte Carlo Dropout (MCD), yielding an unacceptable Expected Calibration Error (ECE) of 0.7588. This technical failure necessitated a rigorous architectural pivot to a high diversity, 9-member Deep Ensemble (DE). This resulting DE successfully stabilized performance and delivered superior reliability, achieving a State-of-the-Art (SOTA) average Area Under the Receiver Operating Characteristic Curve (AUROC) of 0.8559 and an average F1 Score of 0.3857. Crucially, the DE demonstrated superior calibration (Mean ECE of 0.0728 and Negative Log-Likelihood (NLL) of 0.1916) and enabled the reliable decomposition of total uncertainty into its Aleatoric (irreducible data noise) and Epistemic (reducible model knowledge) components, with a mean Epistemic Uncertainty (EU) of 0.0240. These results establish the Deep Ensemble as a trustworthy and explainable platform, transforming the model from a probabilistic tool into a reliable clinical decision support system.
Problem

Research questions and friction points this paper is trying to address.

Addressing unreliable predictive confidence in deterministic deep learning models
Stabilizing performance and calibration for thoracic disease diagnosis
Quantifying uncertainty components to enable trustworthy clinical decision support
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep Ensemble method for uncertainty quantification
Nine-member ensemble improves calibration and performance
Decomposes uncertainty into aleatoric and epistemic components
🔎 Similar Papers
No similar papers found.
Y
Yasiru Laksara
Department of Computer Science and Engineering, University of Moratuwa, Katubedda 10400, Sri Lanka
Uthayasanker Thayasivam
Uthayasanker Thayasivam
Senior Lecturer Department of Computer Science and Engineering, University of Moratuwa
nlpmldata science