🤖 AI Summary
This study investigates the root causes of concurrent misclassifications by artificial intelligence and human experts in dermoscopic images, distinguishing whether errors stem from algorithmic bias or inherent visual ambiguity in the images. Using multiple CNN architectures, systematically misclassified samples were identified, and dermatologists independently evaluated these challenging images alongside a control set. Diagnostic agreement was quantified using Cohen’s kappa (between experts and ground truth) and Fleiss’ kappa (among experts). Results revealed a sharp decline in agreement between experts and ground truth for difficult images (κ = 0.08) compared to controls (κ = 0.61), with inter-expert agreement also significantly reduced (0.275 vs. 0.456). These findings demonstrate, for the first time, a phenomenon of synchronized failure between AI and humans on specific images, indicating that such errors are primarily attributable to intrinsic image ambiguity and underscoring the critical role of image quality in diagnostic reliability.
📝 Abstract
The integration of artificial intelligence (AI), particularly Convolutional Neural Networks (CNNs), into dermatological diagnosis demonstrates substantial clinical potential. While existing literature predominantly benchmarks algorithmic performance against human experts, our study adopts a novel perspective by investigating the intrinsic complexity of dermatoscopic images. Through rigorous experimentation with multiple CNN architectures, we isolated a subset of images systematically misclassified across all models-a phenomenon statistically proven to exceed random chance. To determine if these failures stem from algorithmic biases or inherent visual ambiguity, expert dermatologists independently evaluated these challenging cases alongside a control group. The results revealed a collapse in human diagnostic performance on the AI-misclassified images. First, agreement with ground-truth labels plummeted, with Cohen's kappa dropping to a mere 0.08 for the difficult images, compared to a 0.61 for the control group. Second, we observed a severe deterioration in expert consensus; inter-rater reliability among physicians fell from moderate concordance (Fleiss kappa = 0.456) on control images to only modest agreement (Fleiss kappa = 0.275) on difficult cases. We identified image quality as a primary driver of these dual systematic failures. To promote transparency and reproducibility, all data, code, and trained models have been made publicly available