Stronger ViTs With Octic Equivariance

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address computational inefficiency and limited geometric robustness in Vision Transformers (ViTs), this work introduces octic equivariance—encompassing reflections and 90° rotations—as a systematic inductive bias into the ViT architecture, yielding the octic ViT. We design octic-equivariant self-attention and feed-forward layers that explicitly encode image symmetries, thereby improving parameter efficiency and transformation robustness. The method is instantiated within both DeiT-III (supervised) and DINOv2 (self-supervised) frameworks and evaluated on ImageNet-1K via joint supervised and self-supervised training. Experiments demonstrate that the ViT-H variant reduces FLOPs by approximately 40% while consistently improving performance on both image classification and semantic segmentation tasks. These results validate high-order equivariance as a powerful and generalizable inductive bias for vision transformers.

Technology Category

Application Category

📝 Abstract

Recent efforts at scaling computer vision models have established Vision Transformers (ViTs) as the leading architecture. ViTs incorporate weight sharing over image patches as an important inductive bias. In this work, we show that ViTs benefit from incorporating equivariance under the octic group, i.e., reflections and 90-degree rotations, as a further inductive bias. We develop new architectures, octic ViTs, that use octic-equivariant layers and put them to the test on both supervised and self-supervised learning. Through extensive experiments on DeiT-III and DINOv2 training on ImageNet-1K, we show that octic ViTs yield more computationally efficient networks while also improving performance. In particular, we achieve approximately 40% reduction in FLOPs for ViT-H while simultaneously improving both classification and segmentation results.

Problem

Research questions and friction points this paper is trying to address.

Enhancing ViTs with octic equivariance for better performance

Reducing computational costs in ViTs while improving accuracy

Incorporating octic group equivariance in ViT architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates octic group equivariance in ViTs

Uses octic-equivariant layers for efficiency

Reduces FLOPs by 40% while improving performance

🔎 Similar Papers

Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens