$A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the observation that large self-supervised Vision Transformers (ViTs) exhibit weaker attention localization capabilities compared to their smaller counterparts, often failing to effectively focus on foreground objects in visual classification tasks. To overcome this limitation, the authors propose $A^2$, a training-free, group-label-agnostic two-stage approach. In the first stage, foreground regions are precisely cropped using attention peaks from a small ViT; in the second, a large ViT embeds these cropped regions to produce rich representations. By decoupling localization and representation learning, $A^2$ synergistically combines the strong localization ability of small models with the powerful representational capacity of large models. The method matches the performance of state-of-the-art approaches like DFR across five benchmarks and significantly outperforms end-to-end attention training schemes under strong distribution shifts.

📝 Abstract

Robust visual classification often depends on localizing the main foreground objects in an image while ignoring contextual distractors. Surprisingly, we find that the attention maps of smaller self-supervised ViTs localize foreground objects better than those of larger ViTs. However, we still need large ViTs, because they extract richer representations from each patch. To get the best of both worlds, good localization and rich representations, we propose $A^2$, a simple method that leverages this inverse scaling finding by decoupling where to look (a small attention model) from what to extract (a large embedding model): we crop around the attention peaks of a small model and embed the crops with a larger model. $A^2$ uses entirely pretrained features, requires no group labels, and does not require per-dataset attention or backbone training. Across 5 benchmarks, $A^2$ is competitive with backbone-matched loss-level methods like DFR, and outperforms end-to-end attention training under stronger distribution shifts.

Problem

Research questions and friction points this paper is trying to address.

self-supervised ViTs

object localization

attention maps

representation learning

distribution shift

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised ViT

foreground localization

attention decoupling