Data Pruning by Information Maximization

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the problem of efficient subset selection from large-scale datasets—specifically, constructing high-information, low-redundancy coresets to enhance downstream model performance. To this end, we propose InfoMax: the first method to formulate data pruning as a discrete quadratic programming problem, jointly optimizing sample importance and pairwise similarities. We design a gradient-based solver augmented with similarity matrix sparsification and data blocking, enabling scalable optimization on million-scale datasets. Evaluated on image classification, vision-language pretraining, and instruction tuning of large language models, InfoMax consistently outperforms state-of-the-art pruning methods—improving model accuracy by 3–8% at identical subset sizes. These results empirically validate InfoMax’s effectiveness in preserving task-relevant information while suppressing redundancy.

Technology Category

Application Category

📝 Abstract
In this paper, we present InfoMax, a novel data pruning method, also known as coreset selection, designed to maximize the information content of selected samples while minimizing redundancy. By doing so, InfoMax enhances the overall informativeness of the coreset. The information of individual samples is measured by importance scores, which capture their influence or difficulty in model learning. To quantify redundancy, we use pairwise sample similarities, based on the premise that similar samples contribute similarly to the learning process. We formalize the coreset selection problem as a discrete quadratic programming (DQP) task, with the objective of maximizing the total information content, represented as the sum of individual sample contributions minus the redundancies introduced by similar samples within the coreset. To ensure practical scalability, we introduce an efficient gradient-based solver, complemented by sparsification techniques applied to the similarity matrix and dataset partitioning strategies. This enables InfoMax to seamlessly scale to datasets with millions of samples. Extensive experiments demonstrate the superior performance of InfoMax in various data pruning tasks, including image classification, vision-language pre-training, and instruction tuning for large language models.
Problem

Research questions and friction points this paper is trying to address.

Maximize information content while minimizing redundancy in data pruning
Measure sample importance and similarity to optimize coreset selection
Scale efficiently to large datasets with millions of samples
Innovation

Methods, ideas, or system contributions that make the work stand out.

InfoMax maximizes information content in coresets
Uses discrete quadratic programming for selection
Efficient gradient-based solver ensures scalability