🤖 AI Summary
This study addresses the limited clinical interpretability of Vision Transformers (ViTs) in medical imaging by systematically evaluating the clinical validity of attention maps for localizing critical pathological regions. We propose the first attention explanation evaluation framework tailored to medical imaging, introducing two novel quantitative metrics—*anatomical consistency* and *lesion sensitivity*—to assess attention map fidelity. Our methodology integrates multi-source validation: Grad-CAM and attention rollout visualizations, expert radiologist annotations, and statistical significance testing. Experiments on CheXpert and MIMIC-CXR reveal that only 38% of attention heatmaps achieve high spatial alignment with expert annotations, exposing substantial clinical bias in current ViT interpretability methods. This work quantifies the reliability boundary of Transformer-based attention explanations and establishes a reproducible, clinically grounded evaluation paradigm—thereby guiding future development of trustworthy, deployable explainable AI in radiology.