🤖 AI Summary
This work addresses the limitation imposed by permutation invariance in Vision Transformers (ViTs), which constrains their modeling capacity. To mitigate this, the authors propose Fractal ViT, the first approach to adapt the register mechanism—originally developed in position-encoding-free language models—to the visual domain. The method introduces structured attention masks between regular and summary tokens and systematically evaluates its effectiveness in conjunction with various positional encoding schemes. This exploration demonstrates the feasibility of cross-modal architectural transfer and offers a novel perspective on breaking the permutation symmetry inherent in standard ViTs. Empirical results indicate that, under the current experimental settings, Fractal ViT does not surpass the performance of a standard ViT equipped with registers, suggesting that the efficacy of such mechanisms may be contingent upon model scale or task-specific characteristics.
📝 Abstract
Drawing inspiration from recent findings including surprisingly decent performance of transformers without positional encoding (NoPE) in the domain of language models and how registers (additional throwaway tokens not tied to input) may improve the performance of large vision transformers (ViTs), we invent and test a variant of ViT called fractal ViT that breaks permutation invariance among the tokens by applying an attention mask between the regular tokens and ``summary tokens''similar to registers, in isolation or in combination with various positional encodings. These models do not improve upon ViT with registers, highlighting the fact that these findings may be scale, domain, or application-specific.