🤖 AI Summary
Vision Transformers (ViTs) often rely on spurious cues, undermining their interpretability and controllability and thereby hindering safe deployment. To address this, this work proposes ViSAE, a neuroscientifically inspired toolbox that constructs a probing framework integrating 64K images with a vocabulary of 16K visual concepts. It introduces top-down concept reading and bottom-up circuit tracing algorithms to enable efficient analysis and intervention in ViT internal mechanisms. Leveraging sparse autoencoders, a large-scale concept lexicon, and concept editing techniques, the method achieves a 20-fold improvement in concept coverage efficiency over ImageNet, a 28.7% gain in explanation accuracy, and a 48.2% increase in worst-group accuracy on the WaterBirds dataset, substantially outperforming existing approaches.
📝 Abstract
Despite high accuracy, Vision Transformer (ViT) predictions can be driven by spurious cues, raising the need to understand their inner workings before safe deployment. Sparse autoencoders (SAEs) provide a promising lens for decomposing model representations into human-interpretable concepts, yet adapting SAE-based interpretation to ViTs remains challenging due to limited control over concept coverage and subjective, non-scalable feature interpretation. To fill the gaps, motivated by neuroscience-inspired principles, we propose ViSAE, a mechanistic interpretability toolbox for understanding ViT inner workings through concept circuits. ViSAE consists of three components: (1) A probing suite with 64K images and a 16K visually grounded concept vocabulary, improving concept coverage efficiency by 20x over ImageNet and interpretation accuracy by 28.7% over existing concept sets. (2) Top-down concept reading and Bottom-up circuit tracing algorithms that automatically recover ViT inner workings via concept circuits. (3) Applications for auditing and steering ViT behavior. Through concept editing, ViSAE improves the worst-group accuracy on WaterBirds by 48.2%, outperforming existing methods by 23.8%. Our data and code: https://github.com/deep-real/ViSAE.