Enhancing Multi-Label Thoracic Disease Diagnosis with Deep Ensemble-Based Uncertainty Quantification

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

To address the lack of predictive uncertainty quantification in deep learning models for clinical deployment, this paper proposes a high-diversity 9-member deep ensemble framework integrated with Monte Carlo Dropout, enabling fine-grained decomposition of uncertainty—aleatoric and epistemic—on the NIH ChestX-ray14 multi-label thoracic disease diagnosis task. The method substantially improves model calibration and interpretability: Expected Calibration Error (ECE) drops to 0.0728, Negative Log-Likelihood (NLL) to 0.1916, and mean epistemic uncertainty to 0.0240. It achieves an average AUROC of 0.8559 and F1-score of 0.3857, attaining state-of-the-art performance and reliability. This work represents the first systematic realization of high-accuracy uncertainty modeling for multi-label chest X-ray diagnosis, delivering a clinically deployable technical pathway for trustworthy decision support systems.

Technology Category

Application Category

📝 Abstract

The utility of deep learning models, such as CheXNet, in high stakes clinical settings is fundamentally constrained by their purely deterministic nature, failing to provide reliable measures of predictive confidence. This project addresses this critical gap by integrating robust Uncertainty Quantification (UQ) into a high performance diagnostic platform for 14 common thoracic diseases on the NIH ChestX-ray14 dataset. Initial architectural development failed to stabilize performance and calibration using Monte Carlo Dropout (MCD), yielding an unacceptable Expected Calibration Error (ECE) of 0.7588. This technical failure necessitated a rigorous architectural pivot to a high diversity, 9-member Deep Ensemble (DE). This resulting DE successfully stabilized performance and delivered superior reliability, achieving a State-of-the-Art (SOTA) average Area Under the Receiver Operating Characteristic Curve (AUROC) of 0.8559 and an average F1 Score of 0.3857. Crucially, the DE demonstrated superior calibration (Mean ECE of 0.0728 and Negative Log-Likelihood (NLL) of 0.1916) and enabled the reliable decomposition of total uncertainty into its Aleatoric (irreducible data noise) and Epistemic (reducible model knowledge) components, with a mean Epistemic Uncertainty (EU) of 0.0240. These results establish the Deep Ensemble as a trustworthy and explainable platform, transforming the model from a probabilistic tool into a reliable clinical decision support system.

Problem

Research questions and friction points this paper is trying to address.

Addressing unreliable predictive confidence in deterministic deep learning models

Stabilizing performance and calibration for thoracic disease diagnosis

Quantifying uncertainty components to enable trustworthy clinical decision support

Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep Ensemble method for uncertainty quantification

Nine-member ensemble improves calibration and performance

Decomposes uncertainty into aleatoric and epistemic components

🔎 Similar Papers

No similar papers found.