🤖 AI Summary
This work addresses the challenge of improving out-of-distribution (OoD) generalization for Vision Transformers (ViTs). To this end, we introduce OoD-ViT-NAS—the first neural architecture search (NAS) benchmark specifically designed for OoD generalization—systematically evaluating 3,000 ViT architectures across eight OoD datasets. Our analysis reveals that in-distribution (ID) accuracy is an unreliable predictor of OoD performance; conventional training-free NAS methods fail consistently, whereas lightweight proxies—such as parameter count and FLOPs—exhibit stronger discriminative power. Crucially, increasing embedding dimension significantly enhances OoD robustness. Extensive ablation studies and attribution analyses confirm that architectural choices alone induce up to 11.85% variance in OoD accuracy. This work establishes the first reproducible benchmark and provides empirically grounded design principles for developing OoD-robust ViTs.
📝 Abstract
While ViTs have achieved across machine learning tasks, deploying them in real-world scenarios faces a critical challenge: generalizing under OoD shifts. A crucial research gap exists in understanding how to design ViT architectures, both manually and automatically, for better OoD generalization. To this end, we introduce OoD-ViT-NAS, the first systematic benchmark for ViTs NAS focused on OoD generalization. This benchmark includes 3000 ViT architectures of varying computational budgets evaluated on 8 common OoD datasets. Using this benchmark, we analyze factors contributing to OoD generalization. Our findings reveal key insights. First, ViT architecture designs significantly affect OoD generalization. Second, ID accuracy is often a poor indicator of OoD accuracy, highlighting the risk of optimizing ViT architectures solely for ID performance. Third, we perform the first study of NAS for ViTs OoD robustness, analyzing 9 Training-free NAS methods. We find that existing Training-free NAS methods are largely ineffective in predicting OoD accuracy despite excelling at ID accuracy. Simple proxies like Param or Flop surprisingly outperform complex Training-free NAS methods in predicting OoD accuracy. Finally, we study how ViT architectural attributes impact OoD generalization and discover that increasing embedding dimensions generally enhances performance. Our benchmark shows that ViT architectures exhibit a wide range of OoD accuracy, with up to 11.85% improvement for some OoD shifts. This underscores the importance of studying ViT architecture design for OoD. We believe OoD-ViT-NAS can catalyze further research into how ViT designs influence OoD generalization.