Beyond [cls]: Exploring the true potential of Masked Image Modeling representations

📅 2024-12-04
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the root cause of poor “out-of-the-box” performance of masked image modeling (MIM) representations, identifying that the [cls] token in standard Vision Transformers (ViTs) fails to effectively aggregate semantic information due to uniform attention distribution. To address this, we propose Selective Aggregation: instead of relying on a single [cls] token, our method dynamically selects the most discriminative patch tokens based on token-level semantic importance and performs lightweight, learnable aggregation. Crucially, the approach introduces no additional parameters, requires no extra training data, and operates without fine-tuning. On ImageNet-1K linear probing, it achieves an 8.2% relative improvement over baseline MIM representations. This significantly enhances both the generalization capability and plug-and-play usability of self-supervised visual representations, establishing a new paradigm for downstream adaptation of MIM features.

Technology Category

Application Category

📝 Abstract
Masked Image Modeling (MIM) has emerged as a promising approach for Self-Supervised Learning (SSL) of visual representations. However, the out-of-the-box performance of MIMs is typically inferior to competing approaches. Most users cannot afford fine-tuning due to the need for large amounts of data, high GPU consumption, and specialized user knowledge. Therefore, the practical use of MIM representations is limited. In this paper we ask what is the reason for the poor out-of-the-box performance of MIMs. Is it due to weaker features produced by MIM models, or is it due to suboptimal usage? Through detailed analysis, we show that attention in MIMs is spread almost uniformly over many patches, leading to ineffective aggregation by the [cls] token. Based on this insight, we propose Selective Aggregation to better capture the rich semantic information retained in patch tokens, which significantly improves the out-of-the-box performance of MIM.
Problem

Research questions and friction points this paper is trying to address.

Investigating poor out-of-the-box performance of Masked Image Modeling
Analyzing ineffective attention aggregation in MIM representations
Proposing Selective Aggregation to enhance MIM feature utilization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective Aggregation for MIM representations
Improves out-of-the-box MIM performance
Analyzes attention spread in MIM models
🔎 Similar Papers
M
Marcin Przewiȩźlikowski
Jagiellonian University
Randall Balestriero
Randall Balestriero
AI Researcher
Self Supervised LearningUseful TheorySplines
W
Wojciech Jasi'nski
AGH University of Science and Technology
Marek Śmieja
Marek Śmieja
Jagiellonian University
deep learning for tabular datagenerative modelsunsupervised learningcheminformatics
B
Bartosz Zieli'nski
Jagiellonian University