ViT Registers and Fractal ViT

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation imposed by permutation invariance in Vision Transformers (ViTs), which constrains their modeling capacity. To mitigate this, the authors propose Fractal ViT, the first approach to adapt the register mechanism—originally developed in position-encoding-free language models—to the visual domain. The method introduces structured attention masks between regular and summary tokens and systematically evaluates its effectiveness in conjunction with various positional encoding schemes. This exploration demonstrates the feasibility of cross-modal architectural transfer and offers a novel perspective on breaking the permutation symmetry inherent in standard ViTs. Empirical results indicate that, under the current experimental settings, Fractal ViT does not surpass the performance of a standard ViT equipped with registers, suggesting that the efficacy of such mechanisms may be contingent upon model scale or task-specific characteristics.

Technology Category

Application Category

📝 Abstract
Drawing inspiration from recent findings including surprisingly decent performance of transformers without positional encoding (NoPE) in the domain of language models and how registers (additional throwaway tokens not tied to input) may improve the performance of large vision transformers (ViTs), we invent and test a variant of ViT called fractal ViT that breaks permutation invariance among the tokens by applying an attention mask between the regular tokens and ``summary tokens''similar to registers, in isolation or in combination with various positional encodings. These models do not improve upon ViT with registers, highlighting the fact that these findings may be scale, domain, or application-specific.
Problem

Research questions and friction points this paper is trying to address.

Vision Transformer
registers
permutation invariance
positional encoding
summary tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fractal ViT
Registers
Permutation Invariance
Positional Encoding
Attention Mask
🔎 Similar Papers
No similar papers found.
J
Jason Chuan-Chih Chou
Cohere Labs Community, Toronto, ON, Canada
A
Abhinav Kumar
Mathematics and Computing, Indian Institute of Technology Roorkee, Roorkee, Uttarakhand, India
Shivank Garg
Shivank Garg
UG Student Artificial Intelligence and Data Science,IIT Roorkee
Deep LearningGenerative AIAI Security