Selecting Samples on Graphs: A Unified Dataset Pruning Framework for Lossless Training Acceleration

📅 2026-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of large-scale training and the limited robustness of existing data pruning methods across varying pruning ratios or data distributions. The authors model the dataset as a weighted graph, where node weights capture the intrinsic value of individual samples and edge weights encode extrinsic relationships among them. For the first time, this approach unifies intrinsic and extrinsic pruning signals and formulates data pruning as a maximum-weight clique optimization problem with theoretical approximation guarantees. A greedy algorithm based on marginal gain is employed to solve this problem efficiently. The framework accommodates diverse importance metrics and provides a general objective function along with practical design principles. Experiments on ImageNet-1k with ResNet-50 demonstrate over 40% reduction in training time while preserving model accuracy.
📝 Abstract
The rapid growth of modern training datasets has significantly increased computational cost, motivating dataset pruning~(DP) methods which retain only a subset of informative samples to reduce training cost. Existing pruning criteria typically rely on either intrinsic signals that assess samples independently or extrinsic signals that promote diversity via pairwise relations. While effective in their own specific regimes, each captures only one aspect of sample utility and lacks robustness across different pruning ratios or data distribution. In this work, we present a unified graph-based DP framework. By modeling the dataset as a weighted graph, where node weights encode intrinsic value and edge weights encode extrinsic value, DP can be cast as a Maximum Weight Clique Problem (MWCP). Although MWCP is NP-hard, its structure admits a principled greedy solution based on sample-wise marginal gains. Under a few mild conditions, we further prove that this unified objective enjoys a formal approximation guarantee, which applies to a broad family of importance metrics and provides practical design guidelines. Extensive experiments show that our method outperforms existing DP methods while substantially reducing training cost, reducing training time by over 40\% without sacrificing accuracy on ImageNet-1k with ResNet-50.
Problem

Research questions and friction points this paper is trying to address.

dataset pruning
sample selection
intrinsic signals
extrinsic signals
training acceleration
Innovation

Methods, ideas, or system contributions that make the work stand out.

dataset pruning
graph-based framework
maximum weight clique
marginal gain
training acceleration
🔎 Similar Papers
D
Dongyue Wu
State Key Laboratory of Multispectral Information Intelligent Processing Technology, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, China; Ant Group, Hangzhou, China
Z
Zilin Guo
State Key Laboratory of Multispectral Information Intelligent Processing Technology, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, China
X
Xiaoyu Li
Ant Group, Hangzhou, China
Jiajia Liu
Jiajia Liu
Ant Group
cv multimodal
Jingdong Chen
Jingdong Chen
Senior Staff Algorithm Engineer, Ant Group
Computer VisionMultimodal
Nong Sang
Nong Sang
Huazhong University of Science and Technology
Computer Vision and Pattern Recognition
C
Changxin Gao
State Key Laboratory of Multispectral Information Intelligent Processing Technology, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, China