Keypoint Aware Masked Image Modelling

๐Ÿ“… 2024-07-18
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
SimMIM suffers from insufficient representation capability in linear probing, limiting its downstream transfer performance. To address this, we propose KeyPoint-guided Patch-wise Weighting for Masked Image Modeling (KP-MIM), the first framework to integrate human/object keypoint detection into the MIM paradigm. KP-MIM introduces a spatially adaptive weighting mechanism over ViT patch features during reconstruction, enhancing local structural awareness and contextual modelingโ€”thereby aligning MIM representations more closely with contrastive learning properties. On ImageNet-1K, KP-MIM boosts the linear probing accuracy of ViT-Base from 16.12% to 33.97%, while fine-tuning accuracy improves to 77.30%. Our implementation is publicly available.

Technology Category

Application Category

๐Ÿ“ Abstract
SimMIM is a widely used method for pretraining vision transformers using masked image modeling. However, despite its success in fine-tuning performance, it has been shown to perform sub-optimally when used for linear probing. We propose an efficient patch-wise weighting derived from keypoint features which captures the local information and provides better context during SimMIM's reconstruction phase. Our method, KAMIM, improves the top-1 linear probing accuracy from 16.12% to 33.97%, and finetuning accuracy from 76.78% to 77.3% when tested on the ImageNet-1K dataset with a ViT-B when trained for the same number of epochs. We conduct extensive testing on different datasets, keypoint extractors, and model architectures and observe that patch-wise weighting augments linear probing performance for larger pretraining datasets. We also analyze the learned representations of a ViT-B trained using KAMIM and observe that they behave similar to contrastive learning with regard to its behavior, with longer attention distances and homogenous self-attention across layers. Our code is publicly available at https://github.com/madhava20217/KAMIM.
Problem

Research questions and friction points this paper is trying to address.

SimMIM method
linear probing
visual model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

KAMIM
weight allocation
visual pretraining
๐Ÿ”Ž Similar Papers
2024-08-13arXiv.orgCitations: 2