Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS) challenge results

📅 2025-05-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Clinical deployment of medical image segmentation is hindered by annotator variability, leading to subjective ground truths, miscalibrated models, and unreliable uncertainty estimates. Method: We propose the first multi-annotator evaluation framework integrating both consensus and disagreement ground truths. Our approach introduces a novel calibration–uncertainty co-evaluation paradigm jointly leveraging Expected Calibration Error (ECE) and Continuous Ranked Probability Score (CRPS), and employs pre-trained transfer learning with robust multi-source data training to enhance model reliability. Contribution/Results: We systematically demonstrate a strong correlation between calibration quality and segmentation performance (Dice score). The method achieves state-of-the-art (SOTA) segmentation accuracy while significantly improving calibration and confidence–accuracy alignment. Empirical validation confirms that multi-annotator strategies are essential for building clinically trustworthy segmentation models.

Technology Category

Application Category

📝 Abstract
Deep learning (DL) has become the dominant approach for medical image segmentation, yet ensuring the reliability and clinical applicability of these models requires addressing key challenges such as annotation variability, calibration, and uncertainty estimation. This is why we created the Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS), which highlights the critical role of multiple annotators in establishing a more comprehensive ground truth, emphasizing that segmentation is inherently subjective and that leveraging inter-annotator variability is essential for robust model evaluation. Seven teams participated in the challenge, submitting a variety of DL models evaluated using metrics such as Dice Similarity Coefficient (DSC), Expected Calibration Error (ECE), and Continuous Ranked Probability Score (CRPS). By incorporating consensus and dissensus ground truth, we assess how DL models handle uncertainty and whether their confidence estimates align with true segmentation performance. Our findings reinforce the importance of well-calibrated models, as better calibration is strongly correlated with the quality of the results. Furthermore, we demonstrate that segmentation models trained on diverse datasets and enriched with pre-trained knowledge exhibit greater robustness, particularly in cases deviating from standard anatomical structures. Notably, the best-performing models achieved high DSC and well-calibrated uncertainty estimates. This work underscores the need for multi-annotator ground truth, thorough calibration assessments, and uncertainty-aware evaluations to develop trustworthy and clinically reliable DL-based medical image segmentation models.
Problem

Research questions and friction points this paper is trying to address.

Addressing annotation variability in medical image segmentation
Ensuring model calibration and uncertainty estimation reliability
Leveraging multi-annotator data for robust segmentation evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging multi-annotator variability for robust evaluation
Using diverse datasets and pre-trained knowledge for robustness
Incorporating calibration and uncertainty metrics for reliability
🔎 Similar Papers
No similar papers found.
M
Meritxell Riera-Marin
BCN Medtech, Universitat Pompeu Fabra, Barcelona, Spain
Sikha O K
Sikha O K
Senior Researcher-BCN MedTech| Uncertainity in DL | Explainable AI| Computer Vision|Cancer Evolution
Uncertainty in DLExplainable AIComputer Visionimage processingDL/ML
J
Julia Rodriguez-Comas
Sycai Technologies SL, Scientific and Technical Department, Barcelona, Spain
M
Matthias Stefan May
University Hospital Erlangen, Imaging Science Institute, Erlangen, Germany
Z
Zhaohong Pan
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
X
Xiang Zhou
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
X
Xiaokun Liang
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Franciskus Xaverius Erick
Franciskus Xaverius Erick
FAU Erlangen-Nürnberg
A
Andrea Prenner
Friedrich-Alexander-Universitaet Erlangen-Nuernberg (FAU), Erlangen, Germany
C
Cedric Hemon
Hospital de Sant Pau i la Santa Creu, Diagnostic Imaging Department, Barcelona, Spain
Valentin Boussot
Valentin Boussot
PhD Candidate, LTSI
Deep LearningRegistrationSegmentationSMMVATS
J
Jean-Louis Dillenseger
Hospital de Sant Pau i la Santa Creu, Diagnostic Imaging Department, Barcelona, Spain
J
Jean-Claude Nunes
Institut de Recerca Sant Pau - Centre CERCA, Advanced Medical Imaging, Artificial Intelligence, and Imaging-Guided Therapy Research Group, Barcelona, Spain
Abdul Qayyum
Abdul Qayyum
Imperial College London, UK
Machine and Deep LearningBiomedical Signals and ImagingCardiac Digital Twinquantum ML
Moona Mazher
Moona Mazher
University College London, UK
Medical Image AnalysisDeep LearningEEG signal processingMachine LearningBrain signal
S
Steven A Niederer
National Heart and Lung Institute, Faculty of Medicine, Imperial College London, London, United Kingdom
Kaisar Kushibar
Kaisar Kushibar
University of Barcelona
Medical Image AnalysisDeep LearningComputer Vision
C
Carlos Martin-Isla
Barcelona Artificial Intelligence in Medicine Lab (BCN-AIM), Facultat de Matematiques i Informatica, Universitat de Barcelona, Barcelona, Spain
P
Petia Radeva
IBA, Facultat de Matematiques i Informatica, and Institute of Neuroscience, Universitat de Barcelona, Barcelona, Spain
Karim Lekadir
Karim Lekadir
ICREA Research Professor, Universitat de Barcelona
Biomedical data sciencehealthcare AItrustworthy AImedical image analysis
T
Theodore Barfoot
King’s College London (KCL), London, United Kingdom
L
Luis C. Garcia Peraza Herrera
King’s College London (KCL), London, United Kingdom
Ben Glocker
Ben Glocker
Imperial College London
Medical Image AnalysisComputer VisionMachine Learning
Tom Vercauteren
Tom Vercauteren
Professor of Interventional Image Computing, King's College London
Medical Image ComputingImage RegistrationComputer-assisted InterventionsEndomicroscopyImage-guided Interventions
L
Lucas Gago
Universitat de Barcelona (UB), Barcelona, Spain
J
Justin Englemann
J
Joy-Marie Kleiss
Universitaetsklinikum Erlangen, Department of Radiology, Uniklinikum Erlangen, Erlangen, Germany
A
Anton Aubanell
Hospital de Sant Pau i la Santa Creu, Diagnostic Imaging Department, Barcelona, Spain
A
Andreu Antolin
Hospital Universitari Vall d’Hebron, Department of Radiology, Institut de Diagnostic per la Imatge (IDI), Barcelona, Spain
J
Javier Garcia-Lopez
Sycai Technologies SL, Scientific and Technical Department, Barcelona, Spain
M
Miguel A. Gonzalez Ballester
BCN Medtech, Universitat Pompeu Fabra, Barcelona, Spain
Adrian Galdran
Adrian Galdran
Ramon y Cajal / Ikerbasque Research Fellow @ Tecnalia
Medical Computer VisionDeep Learning for Biomedical Image Analysis