FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed

📅 2025-07-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and insufficient robustness to data corruption in large-scale vision foundation models (e.g., DINOv2), this paper proposes a self-supervised pretraining strategy integrating frequency-domain curriculum learning with Gaussian noise patch augmentation. It introduces, for the first time, curriculum learning into the frequency domain: low-frequency semantic information is prioritized early in training, while high-frequency details are progressively incorporated. Concurrently, controllable Gaussian noise patches are injected into input images to enhance generalization robustness. Evaluated on ViT-B/16 using ImageNet-1K, the method reduces pretraining time by 1.6× and FLOPs by 2.25×, achieves comparable robustness to ImageNet-C corruption as the baseline, and maintains competitive linear probe accuracy. The core contributions are (i) a novel frequency-domain curriculum learning paradigm and (ii) a synergistic mechanism between frequency-aware curricula and noise-based augmentation that jointly improve efficiency and robustness.

Technology Category

Application Category

📝 Abstract
Large-scale vision foundation models such as DINOv2 boast impressive performances by leveraging massive architectures and training datasets. But numerous scenarios require practitioners to reproduce those pre-training solutions, such as on private data, new modalities, or simply for scientific questioning--which is currently extremely demanding computation-wise. We thus propose a novel pre-training strategy for DINOv2 that simultaneously accelerates convergence--and strengthens robustness to common corruptions as a by-product. Our approach involves a frequency filtering curriculum--low-frequency being seen first--and the Gaussian noise patching augmentation. Applied to a ViT-B/16 backbone trained on ImageNet-1K, while pre-training time and FLOPs are reduced by 1.6x and 2.25x, our method still achieves matching robustness in corruption benchmarks (ImageNet-C) and maintains competitive linear probing performance compared with baseline. This dual benefit of efficiency and robustness makes large-scale self-supervised foundation modeling more attainable, while opening the door to novel exploration around data curriculum and augmentation as means to improve self-supervised learning models robustness. The code is available at https://github.com/KevinZ0217/fast_dinov2
Problem

Research questions and friction points this paper is trying to address.

Accelerates DINOv2 pre-training convergence speed
Enhances model robustness to common corruptions
Reduces computational demands for foundation model training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Frequency filtering curriculum learning
Gaussian noise patching augmentation
Reduced pre-training time and FLOPs
🔎 Similar Papers
No similar papers found.