🤖 AI Summary
This paper exposes critical trustworthiness deficiencies in intrinsically interpretable deep learning models—such as prototype-based networks and concept bottleneck models—demonstrating that their apparent interpretability fosters visual confirmation bias and renders them highly vulnerable to adversarial manipulation. Method: We formally establish that intrinsic interpretability does not imply robustness or reliability; we introduce two novel adversarial analysis paradigms—prototype manipulation attacks and concept backdoor attacks—and propose a comprehensive framework for evaluating interpretability robustness. Results: Experiments show that minute, imperceptible perturbations to prototypes suffice to induce semantically nonsensical predictions (e.g., misclassifying birds as cars), revealing fundamental flaws in the models’ reasoning mechanisms. Our findings delineate the defensive limits of concept-driven interpretable models, undermining their viability in safety-critical applications, and provide both a methodological foundation and urgent cautionary insights for trustworthy explainable AI research.
📝 Abstract
A common belief is that intrinsically interpretable deep learning models ensure a correct, intuitive understanding of their behavior and offer greater robustness against accidental errors or intentional manipulation. However, these beliefs have not been comprehensively verified, and growing evidence casts doubt on them. In this paper, we highlight the risks related to overreliance and susceptibility to adversarial manipulation of these so-called"intrinsically (aka inherently) interpretable"models by design. We introduce two strategies for adversarial analysis with prototype manipulation and backdoor attacks against prototype-based networks, and discuss how concept bottleneck models defend against these attacks. Fooling the model's reasoning by exploiting its use of latent prototypes manifests the inherent uninterpretability of deep neural networks, leading to a false sense of security reinforced by a visual confirmation bias. The reported limitations of prototype-based networks put their trustworthiness and applicability into question, motivating further work on the robustness and alignment of (deep) interpretable models.