π€ AI Summary
This paper addresses the subset selection problem in metric spaces that jointly optimizes utility and diversity, formalized as the Minimum-Distance Maximization with Monotone Submodular utility (MDMS) problem under a cardinality constraint. As MDMS is NP-hard, we propose GISTβthe first algorithm achieving both theoretical guarantees and practical efficacy. GIST integrates greedy independent set construction with thresholding and employs a bi-criteria approximation scheme, attaining a 1/2 approximation ratio; we further prove that 0.5584 is a tight inapproximability bound. Unlike existing methods that optimize either utility or diversity alone, GIST demonstrates significant performance gains on ImageNet for one-shot subset selection in image classification, empirically validating the effectiveness of co-modeling utility and diversity.
π Abstract
We introduce a novel subset selection problem called min-distance diversification with monotone submodular utility ($ extsf{MDMS}$), which has a wide variety of applications in machine learning, e.g., data sampling and feature selection. Given a set of points in a metric space, the goal of $ extsf{MDMS}$ is to maximize an objective function combining a monotone submodular utility term and a min-distance diversity term between any pair of selected points, subject to a cardinality constraint. We propose the $ exttt{GIST}$ algorithm, which achieves a $frac{1}{2}$-approximation guarantee for $ extsf{MDMS}$ by approximating a series of maximum independent set problems with a bicriteria greedy algorithm. We also prove that it is NP-hard to approximate to within a factor of $0.5584$. Finally, we demonstrate that $ exttt{GIST}$ outperforms existing benchmarks for on a real-world image classification task that studies single-shot subset selection for ImageNet.