🤖 AI Summary
To address the high computational cost and insufficient robustness to data corruption in large-scale vision foundation models (e.g., DINOv2), this paper proposes a self-supervised pretraining strategy integrating frequency-domain curriculum learning with Gaussian noise patch augmentation. It introduces, for the first time, curriculum learning into the frequency domain: low-frequency semantic information is prioritized early in training, while high-frequency details are progressively incorporated. Concurrently, controllable Gaussian noise patches are injected into input images to enhance generalization robustness. Evaluated on ViT-B/16 using ImageNet-1K, the method reduces pretraining time by 1.6× and FLOPs by 2.25×, achieves comparable robustness to ImageNet-C corruption as the baseline, and maintains competitive linear probe accuracy. The core contributions are (i) a novel frequency-domain curriculum learning paradigm and (ii) a synergistic mechanism between frequency-aware curricula and noise-based augmentation that jointly improve efficiency and robustness.
📝 Abstract
Large-scale vision foundation models such as DINOv2 boast impressive performances by leveraging massive architectures and training datasets. But numerous scenarios require practitioners to reproduce those pre-training solutions, such as on private data, new modalities, or simply for scientific questioning--which is currently extremely demanding computation-wise. We thus propose a novel pre-training strategy for DINOv2 that simultaneously accelerates convergence--and strengthens robustness to common corruptions as a by-product. Our approach involves a frequency filtering curriculum--low-frequency being seen first--and the Gaussian noise patching augmentation. Applied to a ViT-B/16 backbone trained on ImageNet-1K, while pre-training time and FLOPs are reduced by 1.6x and 2.25x, our method still achieves matching robustness in corruption benchmarks (ImageNet-C) and maintains competitive linear probing performance compared with baseline. This dual benefit of efficiency and robustness makes large-scale self-supervised foundation modeling more attainable, while opening the door to novel exploration around data curriculum and augmentation as means to improve self-supervised learning models robustness. The code is available at https://github.com/KevinZ0217/fast_dinov2