๐ค AI Summary
SimMIM suffers from insufficient representation capability in linear probing, limiting its downstream transfer performance. To address this, we propose KeyPoint-guided Patch-wise Weighting for Masked Image Modeling (KP-MIM), the first framework to integrate human/object keypoint detection into the MIM paradigm. KP-MIM introduces a spatially adaptive weighting mechanism over ViT patch features during reconstruction, enhancing local structural awareness and contextual modelingโthereby aligning MIM representations more closely with contrastive learning properties. On ImageNet-1K, KP-MIM boosts the linear probing accuracy of ViT-Base from 16.12% to 33.97%, while fine-tuning accuracy improves to 77.30%. Our implementation is publicly available.
๐ Abstract
SimMIM is a widely used method for pretraining vision transformers using masked image modeling. However, despite its success in fine-tuning performance, it has been shown to perform sub-optimally when used for linear probing. We propose an efficient patch-wise weighting derived from keypoint features which captures the local information and provides better context during SimMIM's reconstruction phase. Our method, KAMIM, improves the top-1 linear probing accuracy from 16.12% to 33.97%, and finetuning accuracy from 76.78% to 77.3% when tested on the ImageNet-1K dataset with a ViT-B when trained for the same number of epochs. We conduct extensive testing on different datasets, keypoint extractors, and model architectures and observe that patch-wise weighting augments linear probing performance for larger pretraining datasets. We also analyze the learned representations of a ViT-B trained using KAMIM and observe that they behave similar to contrastive learning with regard to its behavior, with longer attention distances and homogenous self-attention across layers. Our code is publicly available at https://github.com/madhava20217/KAMIM.