🤖 AI Summary
This work addresses the limited robustness of existing vision models under low-data medical imaging scenarios, particularly against image corruptions and adversarial perturbations. It introduces ZACH-ViT—a compact, permutation-invariant Transformer architecture without positional encodings—into robustness evaluation for the first time. The study systematically assesses ZACH-ViT’s performance across seven MedMNIST subsets (with only 50 samples per class) under FGSM and PGD adversarial attacks as well as common image corruptions, using a unified hyperparameter setting and five random seeds. Comparative experiments against baselines including ABMIL, Minimal-ViT, and TransMIL demonstrate that ZACH-ViT achieves the best average rank on clean data and common corruptions (1.57), ranks first under FGSM (2.00), and second under PGD (2.29), highlighting its superior overall robustness.
📝 Abstract
The recently introduced ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer) formalized a compact permutation-invariant Vision Transformer for medical imaging and argued that architectural alignment with spatial structure can matter more than universal benchmark dominance. Its design was motivated by the observation that positional embeddings and a dedicated class token encode fixed spatial assumptions that may be suboptimal when spatial organization is weakly informative, locally distributed, or variable across biomedical images. The foundational study established a regime-dependent clean performance profile across MedMNIST, but did not examine robustness in detail. In this work, we present the first robustness-focused extension of ZACH-ViT by evaluating its behavior under common image corruptions and adversarial perturbations in the same low-data setting. We compare ZACH-ViT with three scratch-trained compact baselines, ABMIL, Minimal-ViT, and TransMIL, on seven MedMNIST datasets using 50 samples per class, fixed hyperparameters, and five random seeds. Across the benchmark, ZACH-ViT achieves the best overall mean rank on clean data (1.57) and under common corruptions (1.57), indicating a favorable balance between baseline predictive performance and robustness to realistic image degradation. Under adversarial stress, all models deteriorate substantially; nevertheless, ZACH-ViT remains competitive, ranking first under FGSM (2.00) and second under PGD (2.29), where ABMIL performs best overall. These results extend the original ZACH-ViT narrative: the advantages of compact permutation-invariant transformers are not limited to clean evaluation, but can persist under realistic perturbation stress in low-data medical imaging, while adversarial robustness remains an open challenge for all evaluated models.