Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Current data compression methods overly rely on soft labels generated by large pre-trained models, neglecting intrinsic image information, while existing distillation and pruning approaches lack a unified evaluation benchmark—leading to unfair comparisons and poor reproducibility. To address these issues, the authors establish the first fair and unified benchmark for systematic evaluation of mainstream compression techniques. They empirically demonstrate, for the first time, that soft labels mislead performance in large-scale compression scenarios. Furthermore, they propose PCA—a purely image-driven framework comprising Pruning, Combining, and Augmenting—that operates solely on raw images and hard labels, eliminating dependence on pre-trained models or soft-label generation. PCA employs generative-free image selection, image-level compositional augmentation, and hard-label supervised evaluation. On benchmarks including ImageNet, it achieves state-of-the-art accuracy while substantially reducing storage and computational overhead, enhancing both reproducibility and model lightweightness.

Technology Category

Application Category

📝 Abstract

Dataset distillation and dataset pruning are two prominent techniques for compressing datasets to improve computational and storage efficiency. Despite their overlapping objectives, these approaches are rarely compared directly. Even within each field, the evaluation protocols are inconsistent across various methods, which complicates fair comparisons and hinders reproducibility. Considering these limitations, we introduce in this paper a benchmark that equitably evaluates methodologies across both distillation and pruning literatures. Notably, our benchmark reveals that in the mainstream dataset distillation setting for large-scale datasets, which heavily rely on soft labels from pre-trained models, even randomly selected subsets can achieve surprisingly competitive performance. This finding suggests that an overemphasis on soft labels may be diverting attention from the intrinsic value of the image data, while also imposing additional burdens in terms of generation, storage, and application. To address these issues, we propose a new framework for dataset compression, termed Prune, Combine, and Augment (PCA), which focuses on leveraging image data exclusively, relies solely on hard labels for evaluation, and achieves state-of-the-art performance in this setup. By shifting the emphasis back to the images, our benchmark and PCA framework pave the way for more balanced and accessible techniques in dataset compression research. Our code is available at: https://github.com/ArmandXiao/Rethinking-Dataset-Compression

Problem

Research questions and friction points this paper is trying to address.

Evaluate dataset compression methods fairly

Shift focus from soft labels to images

Propose PCA framework for image-focused compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates distillation and pruning

PCA framework leverages image data

Focuses on hard labels for evaluation

🔎 Similar Papers

Elucidating the Design Space of Dataset Condensation