RCAP: Robust, Class-Aware, Probabilistic Dynamic Dataset Pruning

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge that existing dynamic data pruning methods struggle to preserve worst-group accuracy under high pruning ratios in class-imbalanced scenarios. To overcome this limitation, the authors propose RCAP, a novel algorithm that introduces, for the first time, a class-aware probabilistic pruning mechanism. RCAP dynamically estimates class-specific retention ratios via class-level loss aggregation and a closed-form solution, and constructs a robust training subset by integrating an adaptive high-loss-first sampling strategy. The method is applicable across diverse learning paradigms—including training from scratch, transfer learning, and fine-tuning—and consistently outperforms state-of-the-art approaches across six benchmark datasets. Notably, RCAP achieves higher worst-group accuracy than full-data training using only 10% of the data, yields an average speedup of 8.69×, and improves performance by over 1% on imbalanced datasets.

📝 Abstract

Dynamic data pruning techniques aim to reduce computational cost while minimizing information loss by periodically selecting representative subsets of input data during model training. However, existing methods often struggle to maintain strong worst-group accuracy, particularly at high pruning rates, across balanced and imbalanced datasets. To address this challenge, we propose RCAP, a Robust, Class-Aware, Probabilistic dynamic dataset pruning algorithm for classification tasks. RCAP applies a closed-form solution to estimate the fraction of samples to be included in the training subset for each individual class. This fraction is adaptively adjusted in every epoch using class-wise aggregated loss. Thereafter, it employs an adaptive sampling strategy that prioritizes samples having high loss for populating the class-wise subsets. We evaluate RCAP on six diverse datasets ranging from class-balanced to highly imbalanced using five distinct models across three training paradigms: training from scratch, transfer learning, and fine-tuning. Our approach consistently outperforms state-of-the-art dataset pruning methods, achieving superior worst-group accuracy at all pruning rates. Remarkably, with only $10\%$ data, RCAP delivers $>1\%$ improvement in performance on class-imbalanced datasets compared to full data training while providing an average $8.69\times$ speedup. The code can be accessed at https://github.com/atif-hassan/RCAP-dynamic-dataset-pruning

Problem

Research questions and friction points this paper is trying to address.

dynamic dataset pruning

worst-group accuracy

class imbalance

computational efficiency

representative subset selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic dataset pruning

class-aware sampling

worst-group accuracy