🤖 AI Summary
This study systematically evaluates the robustness and potential gender bias of AI models for skin cancer diagnosis. Using the PAD-UFES-20 dataset, we compare logistic regression (LR) trained on handcrafted ABCDE/7-point checklist features against fine-tuned ResNet-50 (CNN), assessing performance across training sets with varying gender compositions. We introduce and empirically validate the first cross-gender robustness evaluation framework in dermatology AI. Results demonstrate strong overall robustness for both models; however, the CNN exhibits statistically significant male bias—achieving higher accuracy and AUROC for male patients than female patients (p < 0.01)—whereas LR shows no significant gender disparity. These findings uncover latent gender bias in clinical AI systems and underscore the critical role of medically informed feature engineering in mitigating algorithmic bias. The work establishes a methodological foundation and empirical evidence for developing trustworthy, equitable AI tools in skin cancer diagnosis.
📝 Abstract
Deep learning has been reported to achieve high performances in the detection of skin cancer, yet many challenges regarding the reproducibility of results and biases remain. This study is a replication (different data, same analysis) of a study on Alzheimer's disease [28] which studied robustness of logistic regression (LR) and convolutional neural networks (CNN) across patient sexes. We explore sex bias in skin cancer detection, using the PAD-UFES-20 dataset with LR trained on handcrafted features reflecting dermatological guidelines (ABCDE and the 7-point checklist), and a pre-trained ResNet-50 model. We evaluate these models in alignment with [28]: across multiple training datasets with varied sex composition to determine their robustness. Our results show that both the LR and the CNN were robust to the sex distributions, but the results also revealed that the CNN had a significantly higher accuracy (ACC) and area under the receiver operating characteristics (AUROC) for male patients than for female patients. We hope these findings to contribute to the growing field of investigating potential bias in popular medical machine learning methods. The data and relevant scripts to reproduce our results can be found in our Github.