Hierarchical Vision Transformer with Prototypes for Interpretable Medical Image Classification

📅 2025-02-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited interpretability of vision transformers (ViTs) in medical image classification, this paper proposes HierViT—a novel intrinsically interpretable hierarchical vision transformer. HierViT introduces a prototype-driven architecture that embeds domain-specific medical priors and representative image prototypes into a multi-scale feature space, enabling simultaneous prediction and explanation. Its dual-path interpretability mechanism—combining prototype matching with attention heatmaps—ensures semantic alignment with clinical diagnostic reasoning. A prototype memory module and hierarchical feature disentanglement design further support fine-grained, clinically verifiable attribution analysis. Evaluated on the LIDC-IDRI and derm7pt benchmarks, HierViT achieves state-of-the-art (SOTA) performance on the former and SOTA-comparable accuracy on the latter, while substantially enhancing the trustworthiness and clinical acceptability of AI decisions—particularly in high-risk scenarios.

Technology Category

Application Category

📝 Abstract
Explainability is a highly demanded requirement for applications in high-risk areas such as medicine. Vision Transformers have mainly been limited to attention extraction to provide insight into the model's reasoning. Our approach combines the high performance of Vision Transformers with the introduction of new explainability capabilities. We present HierViT, a Vision Transformer that is inherently interpretable and adapts its reasoning to that of humans. A hierarchical structure is used to process domain-specific features for prediction. It is interpretable by design, as it derives the target output with human-defined features that are visualized by exemplary images (prototypes). By incorporating domain knowledge about these decisive features, the reasoning is semantically similar to human reasoning and therefore intuitive. Moreover, attention heatmaps visualize the crucial regions for identifying each feature, thereby providing HierViT with a versatile tool for validating predictions. Evaluated on two medical benchmark datasets, LIDC-IDRI for lung nodule assessment and derm7pt for skin lesion classification, HierViT achieves superior and comparable prediction accuracy, respectively, while offering explanations that align with human reasoning.
Problem

Research questions and friction points this paper is trying to address.

Enhances medical image classification interpretability
Integrates human-like reasoning in Vision Transformers
Visualizes decisive features using attention heatmaps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Vision Transformer
Prototype-based interpretability
Attention heatmap visualization
🔎 Similar Papers
No similar papers found.
L
Luisa Gallée
Experimental Radiology, Ulm University Medical Center, Germany; XAIRAD - Cooperation for Artificial Intelligence in Experimental Radiology, Germany
C
C. Lisson
Department of Diagnostic and Interventional Radiology, Ulm University Medical Center, Germany
M
Meinrad Beer
Department of Diagnostic and Interventional Radiology, Ulm University Medical Center, Germany; i2SouI - Innovative Imaging in Surgical Oncology Ulm, Ulm University Medical Center, Germany; XAIRAD - Cooperation for Artificial Intelligence in Experimental Radiology, Germany; BGZ - Bildgebungszentrum, Ulm University Medical Center, Germany
Michael Götz
Michael Götz
Junior Professor, Section Experimental Radiology, University Hospital Ulm
Machine LearningPersonalized MedicineRadiomicsTransfer LearningMedical Image Analysis