🤖 AI Summary
In dermatological image classification, inconsistent data preprocessing, augmentation strategies, and evaluation protocols commonly lead to inflated model performance estimates. This work systematically identifies high-risk practices—including pre-segmentation augmentation and validation set misuse—and introduces the first standardized evaluation framework tailored to dermatological imaging. We propose a unified benchmark protocol incorporating differentiated testing for typical, atypical, and composite lesion images, built upon DINOv2-Large with standardized training and inference. Through attention heatmap analysis and cross-dataset error attribution (HAM10000, DermNet, ISIC Atlas), we reveal critical model limitations: macro-F1 scores of 0.85, 0.71, and 0.84 across the three benchmarks—demonstrating marked vulnerability to atypical and composite cases. All code and methodological guidelines are publicly released to promote reproducibility and fair comparison.
📝 Abstract
Deep Learning approaches in dermatological image classification have shown promising results, yet the field faces significant methodological challenges that impede proper evaluation. This paper presents a dual contribution: first, a systematic analysis of current methodological practices in skin disease classification research, revealing substantial inconsistencies in data preparation, augmentation strategies, and performance reporting; second, a comprehensive training and evaluation framework demonstrated through experiments with the DINOv2-Large vision transformer across three benchmark datasets (HAM10000, DermNet, ISIC Atlas). The analysis identifies concerning patterns, including pre-split data augmentation and validation-based reporting, potentially leading to overestimated metrics, while highlighting the lack of unified methodology standards. The experimental results demonstrate DINOv2's performance in skin disease classification, achieving macro-averaged F1-scores of 0.85 (HAM10000), 0.71 (DermNet), and 0.84 (ISIC Atlas). Attention map analysis reveals critical patterns in the model's decision-making, showing sophisticated feature recognition in typical presentations but significant vulnerabilities with atypical cases and composite images. Our findings highlight the need for standardized evaluation protocols and careful implementation strategies in clinical settings. We propose comprehensive methodological recommendations for model development, evaluation, and clinical deployment, emphasizing rigorous data preparation, systematic error analysis, and specialized protocols for different image types. To promote reproducibility, we provide our implementation code through GitHub. This work establishes a foundation for rigorous evaluation standards in dermatological image classification and provides insights for responsible AI implementation in clinical dermatology.