Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials

📅 2025-11-02

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Multi-head self-attention (MHSA) in Vision Transformers (ViTs) incurs quadratic computational complexity over tokens, leading to inefficient capture of visually redundant correlations. To address this, we propose Visual-Contrast Attention (VCA), which replaces MHSA with a visual contrastive learning paradigm. VCA introduces contrastive tokens generated via spatial pooling and establishes a dual-branch interaction mechanism—comprising positive and negative streams—alongside dual positional encodings, enabling fine-grained discriminative modeling at linear complexity O(NnC). The module is lightweight, introduces zero additional FLOPs, and is fully plug-and-play. On ImageNet-1K, VCA boosts DeiT-Tiny’s top-1 accuracy from 72.2% to 75.6%. It consistently enhances diverse ViT architectures and improves generative model performance, reducing the Fréchet Inception Distance (FID) by 5.2 points—demonstrating its general effectiveness across both discriminative and generative vision tasks.

Technology Category

Application Category

📝 Abstract

Vision Transformers (ViTs) have become a universal backbone for both image recognition and image generation. Yet their Multi-Head Self-Attention (MHSA) layer still performs a quadratic query-key interaction for every token pair, spending the bulk of computation on visually weak or redundant correlations. We introduce Visual-Contrast Attention (VCA), a drop-in replacement for MHSA that injects an explicit notion of discrimination while reducing the theoretical complexity from O(N N C) to O(N n C) with n << N. VCA first distils each head's dense query field into a handful of spatially pooled visual-contrast tokens, then splits them into a learnable positive and negative stream whose differential interaction highlights what truly separates one region from another. The module adds fewer than 0.3M parameters to a DeiT-Tiny backbone, requires no extra FLOPs, and is wholly architecture-agnostic. Empirically, VCA lifts DeiT-Tiny top-1 accuracy on ImageNet-1K from 72.2% to 75.6% (+3.4) and improves three strong hierarchical ViTs by up to 3.1%, while in class-conditional ImageNet generation it lowers FID-50K by 2.1 to 5.2 points across both diffusion (DiT) and flow (SiT) models. Extensive ablations confirm that (i) spatial pooling supplies low-variance global cues, (ii) dual positional embeddings are indispensable for contrastive reasoning, and (iii) combining the two in both stages yields the strongest synergy. VCA therefore offers a simple path towards faster and sharper Vision Transformers. The source code is available at https://github.com/LeapLabTHU/LinearDiff.

Problem

Research questions and friction points this paper is trying to address.

Reducing quadratic complexity in Vision Transformers' attention mechanism

Improving visual discrimination while maintaining computational efficiency

Enhancing both recognition and generation performance of ViT models

Innovation

Methods, ideas, or system contributions that make the work stand out.

VCA replaces MHSA with visual-contrast attention mechanism

It pools tokens into positive and negative contrast streams

Reduces complexity to O(N n C) while improving accuracy

🔎 Similar Papers

No similar papers found.