Faster Algorithms for Fair Max-Min Diversification in Rd

📅 2024-04-06

🏛️ Proc. ACM Manag. Data

📈 Citations: 2

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This paper studies the FairDiv problem—selecting a size-$k$ subset from a large $d$-dimensional dataset to maximize the minimum pairwise distance, subject to group fairness constraints. To address the prohibitive time and space complexity of existing algorithms, we propose the first near-linear-time (i.e., $O(n log n)$) constant-factor approximation algorithm. Our method introduces an implicit linear programming framework that integrates multiplicative weight updates with high-dimensional geometric structures—including grid-based partitioning and distance indexing—to efficiently handle fairness constraints. Furthermore, we construct the first distribution-free fair coreset supporting streaming computation. Experiments on million-scale datasets yield near-optimal solutions within minutes—over 100× faster than state-of-the-art methods—while significantly reducing memory consumption.

Technology Category

Application Category

📝 Abstract

The task of extracting a diverse subset from a dataset, often referred to as maximum diversification, plays a pivotal role in various real-world applications that have far-reaching consequences. In this work, we delve into the realm of fairness-aware data subset selection, specifically focusing on the problem of selecting a diverse set of size k from a large collection of n data points (FairDiv). The FairDiv problem is well-studied in the data management and theory community. In this work, we develop the first constant approximation algorithm for FairDiv that runs in near-linear time using only linear space. In contrast, all previously known constant approximation algorithms run in super-linear time (with respect to n or k) and use super-linear space. Our approach achieves this efficiency by employing a novel combination of the Multiplicative Weight Update method and advanced geometric data structures to implicitly and approximately solve a linear program. Furthermore, we improve the efficiency of our techniques by constructing a coreset. Using our coreset, we also propose the first efficient streaming algorithm for the FairDiv problem whose efficiency does not depend on the distribution of data points. Empirical evaluation on million-sized datasets demonstrates that our algorithm achieves the best diversity within a minute. All prior techniques are either highly inefficient or do not generate a good solution.

Problem

Research questions and friction points this paper is trying to address.

Develop fast algorithm for fair diverse subset selection

Achieve constant approximation in near-linear time

Enable efficient streaming for large datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constant approximation algorithm in near-linear time

Combines Multiplicative Weight Update with geometric data structures

Efficient coreset construction for streaming FairDiv

🔎 Similar Papers

No similar papers found.