🤖 AI Summary
Masked image modeling (MIM) suffers from optimization difficulty and slow convergence in early training stages due to the model’s lack of foundational visual capabilities, hindering effective fitting of complex image distributions. To address this, we propose a prototype-driven curriculum learning framework featuring a novel temperature-regulated, distributional progressive curriculum strategy: typicality priors are constructed via prototype sampling, while soft distribution expansion—controlled by a learnable temperature parameter—enables controllable, staged learning evolution from easy to hard and from concentrated to generalized representations, breaking away from conventional fixed-schedule data scheduling in MIM. Integrated into a masked autoencoder architecture, our method significantly improves representation quality and training efficiency on ImageNet-1K, reducing required training epochs by over 40% at equivalent performance.
📝 Abstract
Masked Image Modeling (MIM) has emerged as a powerful self-supervised learning paradigm for visual representation learning, enabling models to acquire rich visual representations by predicting masked portions of images from their visible regions. While this approach has shown promising results, we hypothesize that its effectiveness may be limited by optimization challenges during early training stages, where models are expected to learn complex image distributions from partial observations before developing basic visual processing capabilities. To address this limitation, we propose a prototype-driven curriculum leagrning framework that structures the learning process to progress from prototypical examples to more complex variations in the dataset. Our approach introduces a temperature-based annealing scheme that gradually expands the training distribution, enabling more stable and efficient learning trajectories. Through extensive experiments on ImageNet-1K, we demonstrate that our curriculum learning strategy significantly improves both training efficiency and representation quality while requiring substantially fewer training epochs compared to standard Masked Auto-Encoding. Our findings suggest that carefully controlling the order of training examples plays a crucial role in self-supervised visual learning, providing a practical solution to the early-stage optimization challenges in MIM.