Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks

📅 2024-10-03

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

🤖 AI Summary

This work addresses the critical gap of dataset distillation (DD) in self-supervised learning (SSL), proposing the first efficient DD framework tailored for SSL. Conventional DD methods rely on supervised signals and fail under label-free SSL due to high-gradient variance. To overcome this, the authors introduce a novel knowledge distillation–based trajectory matching paradigm that jointly optimizes synthetic data trajectories, student–teacher representation alignment, and the SSL pretraining objective—bypassing the instability inherent in standard distillation applied to SSL. The framework enables full SSL pretraining using only a tiny synthetic dataset (e.g., 100 images) and achieves up to 13% accuracy gain over prior methods across multiple downstream tasks. It significantly improves few-shot generalization and provides the first effective, scalable distillation solution for resource-constrained SSL pretraining.

Technology Category

Application Category

📝 Abstract

Dataset distillation (DD) generates small synthetic datasets that can efficiently train deep networks with a limited amount of memory and compute. Despite the success of DD methods for supervised learning, DD for self-supervised pre-training of deep models has remained unaddressed. Pre-training on unlabeled data is crucial for efficiently generalizing to downstream tasks with limited labeled data. In this work, we propose the first effective DD method for SSL pre-training. First, we show, theoretically and empirically, that naive application of supervised DD methods to SSL fails, due to the high variance of the SSL gradient. Then, we address this issue by relying on insights from knowledge distillation (KD) literature. Specifically, we train a small student model to match the representations of a larger teacher model trained with SSL. Then, we generate a small synthetic dataset by matching the training trajectories of the student models. As the KD objective has considerably lower variance than SSL, our approach can generate synthetic datasets that can successfully pre-train high-quality encoders. Through extensive experiments, we show that our distilled sets lead to up to 13% higher accuracy than prior work, on a variety of downstream tasks, in the presence of limited labeled data. Code at https://github.com/BigML-CS-UCLA/MKDT.

Problem

Research questions and friction points this paper is trying to address.

Dataset distillation for self-supervised learning

Reducing SSL gradient variance via knowledge distillation

Efficient pre-training with limited labeled data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset distillation for SSL pre-training

Knowledge distillation reduces SSL gradient variance

Synthetic datasets enhance encoder pre-training efficiency

🔎 Similar Papers

No similar papers found.

Authors to Follow