Dataset Diversity Metrics and Impact on Classification Models

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of a unified definition and quantification of training dataset diversity, which hinders accurate assessment of its impact on model robustness. Focusing on medical imaging, the work presents the first systematic comparison between reference-free and semantic diversity metrics. Leveraging the MorphoMNIST and PadChest datasets, the authors conduct a multidimensional analysis—integrating Fréchet Inception Distance (FID), AUC, semantic diversity measures, controlled perturbations, and clinical expert evaluations—to examine how diversity correlates with expert intuition, downstream performance, and training dynamics. The findings reveal that FID and semantic diversity more effectively predict model performance, whereas merely increasing the number of imaging device sources can inadvertently encourage models to rely on non-robust shortcut features, thereby exposing a critical pitfall in data diversity design.

Technology Category

Application Category

📝 Abstract
The diversity of training datasets is usually perceived as an important aspect to obtain a robust model. However, the definition of diversity is often not defined or differs across papers, and while some metrics exist, the quantification of this diversity is often overlooked when developing new algorithms. In this work, we study the behaviour of multiple dataset diversity metrics for image, text and metadata using MorphoMNIST, a toy dataset with controlled perturbations, and PadChest, a publicly available chest X-ray dataset. We evaluate whether these metrics correlate with each other but also with the intuition of a clinical expert. We also assess whether they correlate with downstream-task performance and how they impact the training dynamic of the models. We find limited correlations between the AUC and image or metadata reference-free diversity metrics, but higher correlations with the FID and the semantic diversity metrics. Finally, the clinical expert indicates that scanners are the main source of diversity in practice. However, we find that the addition of another scanner to the training set leads to shortcut learning. The code used in this study is available at https://github.com/TheoSourget/dataset_diversity_evaluation
Problem

Research questions and friction points this paper is trying to address.

dataset diversity
diversity metrics
classification models
training dynamics
shortcut learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

dataset diversity metrics
reference-free evaluation
shortcut learning
FID
semantic diversity
Théo Sourget
Théo Sourget
PhD Student, PURRlab, IT University of Copenhagen
Deep LearningMedical Image AnalysisFairnessOpen ScienceMeta-research
N
Niclas Claßen
IT University of Copenhagen, Denmark
J
Jack Junchi Xu
Copenhagen University Hospital, Herlev and Gentofte, Denmark
R
Rob van der Goot
IT University of Copenhagen, Denmark
Veronika Cheplygina
Veronika Cheplygina
IT University Copenhagen
meta-researchpattern recognitionmachine learningmedical imagingopen science