🤖 AI Summary
This work addresses the high computational cost of large-scale training and the limited robustness of existing data pruning methods across varying pruning ratios or data distributions. The authors model the dataset as a weighted graph, where node weights capture the intrinsic value of individual samples and edge weights encode extrinsic relationships among them. For the first time, this approach unifies intrinsic and extrinsic pruning signals and formulates data pruning as a maximum-weight clique optimization problem with theoretical approximation guarantees. A greedy algorithm based on marginal gain is employed to solve this problem efficiently. The framework accommodates diverse importance metrics and provides a general objective function along with practical design principles. Experiments on ImageNet-1k with ResNet-50 demonstrate over 40% reduction in training time while preserving model accuracy.
📝 Abstract
The rapid growth of modern training datasets has significantly increased computational cost, motivating dataset pruning~(DP) methods which retain only a subset of informative samples to reduce training cost.
Existing pruning criteria typically rely on either intrinsic signals that assess samples independently or extrinsic signals that promote diversity via pairwise relations.
While effective in their own specific regimes, each captures only one aspect of sample utility and lacks robustness across different pruning ratios or data distribution.
In this work, we present a unified graph-based DP framework.
By modeling the dataset as a weighted graph, where node weights encode intrinsic value and edge weights encode extrinsic value, DP can be cast as a Maximum Weight Clique Problem (MWCP).
Although MWCP is NP-hard, its structure admits a principled greedy solution based on sample-wise marginal gains.
Under a few mild conditions, we further prove that this unified objective enjoys a formal approximation guarantee, which applies to a broad family of importance metrics and provides practical design guidelines.
Extensive experiments show that our method outperforms existing DP methods while substantially reducing training cost, reducing training time by over 40\% without sacrificing accuracy on ImageNet-1k with ResNet-50.