OneBatchPAM: A Fast and Frugal K-Medoids Algorithm

📅 2025-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and memory overhead of k-medoids clustering on large-scale datasets, this paper proposes an efficient, low-memory approximation algorithm. The method employs a single-batch random sampling strategy with sample size $m = O(log n)$ to asymptotically estimate the objective function, reducing pairwise distance computations from $O(n^2)$ to $O(mn)$. We provide the first theoretical guarantee that this sampling scale ensures, with high probability, convergence performance equivalent to full-data local search. The approach integrates batched sampling-based local search, sample-complexity-driven theoretical analysis, and asymptotic objective estimation. Experiments on multiple real-world datasets demonstrate that our algorithm significantly outperforms FasterPAM and BanditPAM++ in runtime, achieves comparable clustering quality, and drastically reduces memory consumption—achieving an optimal trade-off among speed, accuracy, and memory efficiency.

Technology Category

Application Category

📝 Abstract
This paper proposes a novel k-medoids approximation algorithm to handle large-scale datasets with reasonable computational time and memory complexity. We develop a local-search algorithm that iteratively improves the medoid selection based on the estimation of the k-medoids objective. A single batch of size m<<n provides the estimation, which reduces the required memory size and the number of pairwise dissimilarities computations to O(mn), instead of O(n^2) compared to most k-medoids baselines. We obtain theoretical results highlighting that a batch of size m = O(log(n)) is sufficient to guarantee, with strong probability, the same performance as the original local-search algorithm. Multiple experiments conducted on real datasets of various sizes and dimensions show that our algorithm provides similar performances as state-of-the-art methods such as FasterPAM and BanditPAM++ with a drastically reduced running time.
Problem

Research questions and friction points this paper is trying to address.

Efficient Algorithm
Large-scale Datasets
Midpoint Finding
Innovation

Methods, ideas, or system contributions that make the work stand out.

OneBatchPAM
Resource-Efficient Clustering
Large-Scale Data Processing
Antoine de Mathelin
Antoine de Mathelin
ENS Paris-Saclay PhD Student
Machine Learning
N
Nicolas Enrique Cecchi
Centre Borelli, Université Paris-Saclay, CNRS, ENS Paris-Saclay
F
François Deheeger
Michelin
M
M. Mougeot
Centre Borelli, Université Paris-Saclay, CNRS, ENS Paris-Saclay
N
Nicolas Vayatis
Centre Borelli, Université Paris-Saclay, CNRS, ENS Paris-Saclay