🤖 AI Summary
This work addresses the limitations of neural networks trained with conventional cross-entropy loss, which often yield feature attributions lacking fidelity, conciseness, and continuity. The authors propose replacing the standard classification head with a supervised contrastive learning framework to construct a well-structured embedding space that respects class semantics. This approach simultaneously enhances model robustness and interpretability. In the first systematic evaluation of its kind, the method demonstrates significant improvements over cross-entropy training on image classification tasks, achieving competitive accuracy while substantially improving the quality of feature attributions across multiple desiderata. These findings establish a new paradigm for training models that effectively balance predictive performance with transparency.
📝 Abstract
Most Neural Networks (NNs) for classification are trained using Cross-Entropy as a loss function. This approach requires the model to have an explicit classification layer. However, there exist alternative approaches, such as Contrastive Learning (CL). Instead of explicitly operating a classification, CL has the NN produce an embedding space where projections of similar data are pulled together, while projections of dissimilar data are pushed apart. In the case of Supervised CL (SCL), labels are adopted as similarity criteria, thus creating an embedding space where the projected data points are well-clustered. SCL provides crucial advantages over CE with regard to adversarial robustness and out-of-distribution detection, thus making it a more natural choice in safety-critical scenarios. In the present paper, we empirically show that NNs for image classification trained with SCL present higher-quality feature attribution explanations than CL with regard to faithfulness, complexity, and continuity. These results reinforce previous findings about CL-based approaches when targeting more trustworthy and transparent NNs and can guide practitioners in the selection of training objectives targeting not only accuracy, but also transparency of the models.