🤖 AI Summary
Multi-head self-attention (MHSA) in Vision Transformers (ViTs) incurs quadratic computational complexity over tokens, leading to inefficient capture of visually redundant correlations. To address this, we propose Visual-Contrast Attention (VCA), which replaces MHSA with a visual contrastive learning paradigm. VCA introduces contrastive tokens generated via spatial pooling and establishes a dual-branch interaction mechanism—comprising positive and negative streams—alongside dual positional encodings, enabling fine-grained discriminative modeling at linear complexity O(NnC). The module is lightweight, introduces zero additional FLOPs, and is fully plug-and-play. On ImageNet-1K, VCA boosts DeiT-Tiny’s top-1 accuracy from 72.2% to 75.6%. It consistently enhances diverse ViT architectures and improves generative model performance, reducing the Fréchet Inception Distance (FID) by 5.2 points—demonstrating its general effectiveness across both discriminative and generative vision tasks.
📝 Abstract
Vision Transformers (ViTs) have become a universal backbone for both image recognition and image generation. Yet their Multi-Head Self-Attention (MHSA) layer still performs a quadratic query-key interaction for every token pair, spending the bulk of computation on visually weak or redundant correlations. We introduce Visual-Contrast Attention (VCA), a drop-in replacement for MHSA that injects an explicit notion of discrimination while reducing the theoretical complexity from O(N N C) to O(N n C) with n << N. VCA first distils each head's dense query field into a handful of spatially pooled visual-contrast tokens, then splits them into a learnable positive and negative stream whose differential interaction highlights what truly separates one region from another. The module adds fewer than 0.3M parameters to a DeiT-Tiny backbone, requires no extra FLOPs, and is wholly architecture-agnostic. Empirically, VCA lifts DeiT-Tiny top-1 accuracy on ImageNet-1K from 72.2% to 75.6% (+3.4) and improves three strong hierarchical ViTs by up to 3.1%, while in class-conditional ImageNet generation it lowers FID-50K by 2.1 to 5.2 points across both diffusion (DiT) and flow (SiT) models. Extensive ablations confirm that (i) spatial pooling supplies low-variance global cues, (ii) dual positional embeddings are indispensable for contrastive reasoning, and (iii) combining the two in both stages yields the strongest synergy. VCA therefore offers a simple path towards faster and sharper Vision Transformers. The source code is available at https://github.com/LeapLabTHU/LinearDiff.