🤖 AI Summary
To address the limited interpretability of vision transformers (ViTs) in medical image classification, this paper proposes HierViT—a novel intrinsically interpretable hierarchical vision transformer. HierViT introduces a prototype-driven architecture that embeds domain-specific medical priors and representative image prototypes into a multi-scale feature space, enabling simultaneous prediction and explanation. Its dual-path interpretability mechanism—combining prototype matching with attention heatmaps—ensures semantic alignment with clinical diagnostic reasoning. A prototype memory module and hierarchical feature disentanglement design further support fine-grained, clinically verifiable attribution analysis. Evaluated on the LIDC-IDRI and derm7pt benchmarks, HierViT achieves state-of-the-art (SOTA) performance on the former and SOTA-comparable accuracy on the latter, while substantially enhancing the trustworthiness and clinical acceptability of AI decisions—particularly in high-risk scenarios.
📝 Abstract
Explainability is a highly demanded requirement for applications in high-risk areas such as medicine. Vision Transformers have mainly been limited to attention extraction to provide insight into the model's reasoning. Our approach combines the high performance of Vision Transformers with the introduction of new explainability capabilities. We present HierViT, a Vision Transformer that is inherently interpretable and adapts its reasoning to that of humans. A hierarchical structure is used to process domain-specific features for prediction. It is interpretable by design, as it derives the target output with human-defined features that are visualized by exemplary images (prototypes). By incorporating domain knowledge about these decisive features, the reasoning is semantically similar to human reasoning and therefore intuitive. Moreover, attention heatmaps visualize the crucial regions for identifying each feature, thereby providing HierViT with a versatile tool for validating predictions. Evaluated on two medical benchmark datasets, LIDC-IDRI for lung nodule assessment and derm7pt for skin lesion classification, HierViT achieves superior and comparable prediction accuracy, respectively, while offering explanations that align with human reasoning.