🤖 AI Summary
This study addresses the limitation of current dermatology foundation model evaluations, which predominantly rely on simplified binary classification tasks and fail to capture performance in fine-grained differential diagnosis. To this end, we introduce DERM12345, a novel dataset encompassing 40 skin lesion subtypes, and propose the first hierarchical evaluation framework tailored for dermatology, systematically assessing ten foundation models across four clinically relevant granularity levels. Using frozen embeddings with lightweight adapters and five-fold cross-validation, our analysis reveals a significant "granularity gap" between coarse-level screening and fine-grained diagnostic capabilities. While general-purpose models such as MedImageInsights achieve an F1 score of 97.52% on binary tasks, their performance drops sharply to 65.50% on the 40-class task. In contrast, MedSigLip and dermatology-specialized models demonstrate superior fine-grained performance, clarifying the applicability boundaries of different model architectures.
📝 Abstract
Foundation models have transformed medical image analysis by providing robust feature representations that reduce the need for large-scale task-specific training. However, current benchmarks in dermatology often reduce the complex diagnostic taxonomy to flat, binary classification tasks, such as distinguishing melanoma from benign nevi. This oversimplification obscures a model's ability to perform fine-grained differential diagnoses, which is critical for clinical workflow integration. This study evaluates the utility of embeddings derived from ten foundation models, spanning general computer vision, general medical imaging, and dermatology-specific domains, for hierarchical skin lesion classification. Using the DERM12345 dataset, which comprises 40 lesion subclasses, we calculated frozen embeddings and trained lightweight adapter models using a five-fold cross-validation. We introduce a hierarchical evaluation framework that assesses performance across four levels of clinical granularity: 40 Subclasses, 15 Main Classes, 2 and 4 Superclasses, and Binary Malignancy. Our results reveal a"granularity gap"in model capabilities: MedImageInsights achieved the strongest overall performance (97.52% weighted F1-Score on Binary Malignancy detection) but declined to 65.50% on fine-grained 40-class subtype classification. Conversely, MedSigLip (69.79%) and dermatology-specific models (Derm Foundation and MONET) excelled at fine-grained 40-class subtype discrimination while achieving lower overall performance than MedImageInsights on broader classification tasks. Our findings suggest that while general medical foundation models are highly effective for high-level screening, specialized modeling strategies are necessary for the granular distinctions required in diagnostic support systems.