🤖 AI Summary
To address computational inefficiency and limited geometric robustness in Vision Transformers (ViTs), this work introduces octic equivariance—encompassing reflections and 90° rotations—as a systematic inductive bias into the ViT architecture, yielding the octic ViT. We design octic-equivariant self-attention and feed-forward layers that explicitly encode image symmetries, thereby improving parameter efficiency and transformation robustness. The method is instantiated within both DeiT-III (supervised) and DINOv2 (self-supervised) frameworks and evaluated on ImageNet-1K via joint supervised and self-supervised training. Experiments demonstrate that the ViT-H variant reduces FLOPs by approximately 40% while consistently improving performance on both image classification and semantic segmentation tasks. These results validate high-order equivariance as a powerful and generalizable inductive bias for vision transformers.
📝 Abstract
Recent efforts at scaling computer vision models have established Vision Transformers (ViTs) as the leading architecture. ViTs incorporate weight sharing over image patches as an important inductive bias. In this work, we show that ViTs benefit from incorporating equivariance under the octic group, i.e., reflections and 90-degree rotations, as a further inductive bias. We develop new architectures, octic ViTs, that use octic-equivariant layers and put them to the test on both supervised and self-supervised learning. Through extensive experiments on DeiT-III and DINOv2 training on ImageNet-1K, we show that octic ViTs yield more computationally efficient networks while also improving performance. In particular, we achieve approximately 40% reduction in FLOPs for ViT-H while simultaneously improving both classification and segmentation results.