GIST: Greedy Independent Set Thresholding for Diverse Data Summarization

📅 2024-05-29

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This paper addresses the subset selection problem in metric spaces that jointly optimizes utility and diversity, formalized as the Minimum-Distance Maximization with Monotone Submodular utility (MDMS) problem under a cardinality constraint. As MDMS is NP-hard, we propose GIST—the first algorithm achieving both theoretical guarantees and practical efficacy. GIST integrates greedy independent set construction with thresholding and employs a bi-criteria approximation scheme, attaining a 1/2 approximation ratio; we further prove that 0.5584 is a tight inapproximability bound. Unlike existing methods that optimize either utility or diversity alone, GIST demonstrates significant performance gains on ImageNet for one-shot subset selection in image classification, empirically validating the effectiveness of co-modeling utility and diversity.

Technology Category

Application Category

📝 Abstract

We introduce a novel subset selection problem called min-distance diversification with monotone submodular utility ($ extsf{MDMS}$), which has a wide variety of applications in machine learning, e.g., data sampling and feature selection. Given a set of points in a metric space, the goal of $ extsf{MDMS}$ is to maximize an objective function combining a monotone submodular utility term and a min-distance diversity term between any pair of selected points, subject to a cardinality constraint. We propose the $ exttt{GIST}$ algorithm, which achieves a $frac{1}{2}$-approximation guarantee for $ extsf{MDMS}$ by approximating a series of maximum independent set problems with a bicriteria greedy algorithm. We also prove that it is NP-hard to approximate to within a factor of $0.5584$. Finally, we demonstrate that $ exttt{GIST}$ outperforms existing benchmarks for on a real-world image classification task that studies single-shot subset selection for ImageNet.

Problem

Research questions and friction points this paper is trying to address.

Min-distance diversification with submodular utility

Maximize submodular utility and diversity

Approximate maximum independent set problems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Greedy Independent Set Thresholding

Min-distance diversification problem

Bicriteria greedy algorithm

🔎 Similar Papers

No similar papers found.