Keypoint Aware Masked Image Modelling

📅 2024-07-18

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

SimMIM suffers from insufficient representation capability in linear probing, limiting its downstream transfer performance. To address this, we propose KeyPoint-guided Patch-wise Weighting for Masked Image Modeling (KP-MIM), the first framework to integrate human/object keypoint detection into the MIM paradigm. KP-MIM introduces a spatially adaptive weighting mechanism over ViT patch features during reconstruction, enhancing local structural awareness and contextual modeling—thereby aligning MIM representations more closely with contrastive learning properties. On ImageNet-1K, KP-MIM boosts the linear probing accuracy of ViT-Base from 16.12% to 33.97%, while fine-tuning accuracy improves to 77.30%. Our implementation is publicly available.

Technology Category

Application Category

📝 Abstract

SimMIM is a widely used method for pretraining vision transformers using masked image modeling. However, despite its success in fine-tuning performance, it has been shown to perform sub-optimally when used for linear probing. We propose an efficient patch-wise weighting derived from keypoint features which captures the local information and provides better context during SimMIM's reconstruction phase. Our method, KAMIM, improves the top-1 linear probing accuracy from 16.12% to 33.97%, and finetuning accuracy from 76.78% to 77.3% when tested on the ImageNet-1K dataset with a ViT-B when trained for the same number of epochs. We conduct extensive testing on different datasets, keypoint extractors, and model architectures and observe that patch-wise weighting augments linear probing performance for larger pretraining datasets. We also analyze the learned representations of a ViT-B trained using KAMIM and observe that they behave similar to contrastive learning with regard to its behavior, with longer attention distances and homogenous self-attention across layers. Our code is publicly available at https://github.com/madhava20217/KAMIM.

Problem

Research questions and friction points this paper is trying to address.

SimMIM method

linear probing

visual model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

KAMIM

weight allocation

visual pretraining

🔎 Similar Papers

Masked Image Modeling: A Survey