OneBatchPAM: A Fast and Frugal K-Medoids Algorithm

📅 2025-01-31

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

To address the high computational cost and memory overhead of k-medoids clustering on large-scale datasets, this paper proposes an efficient, low-memory approximation algorithm. The method employs a single-batch random sampling strategy with sample size $m = O(log n)$ to asymptotically estimate the objective function, reducing pairwise distance computations from $O(n^2)$ to $O(mn)$. We provide the first theoretical guarantee that this sampling scale ensures, with high probability, convergence performance equivalent to full-data local search. The approach integrates batched sampling-based local search, sample-complexity-driven theoretical analysis, and asymptotic objective estimation. Experiments on multiple real-world datasets demonstrate that our algorithm significantly outperforms FasterPAM and BanditPAM++ in runtime, achieves comparable clustering quality, and drastically reduces memory consumption—achieving an optimal trade-off among speed, accuracy, and memory efficiency.

Technology Category

Application Category

📝 Abstract

This paper proposes a novel k-medoids approximation algorithm to handle large-scale datasets with reasonable computational time and memory complexity. We develop a local-search algorithm that iteratively improves the medoid selection based on the estimation of the k-medoids objective. A single batch of size m<<n provides the estimation, which reduces the required memory size and the number of pairwise dissimilarities computations to O(mn), instead of O(n^2) compared to most k-medoids baselines. We obtain theoretical results highlighting that a batch of size m = O(log(n)) is sufficient to guarantee, with strong probability, the same performance as the original local-search algorithm. Multiple experiments conducted on real datasets of various sizes and dimensions show that our algorithm provides similar performances as state-of-the-art methods such as FasterPAM and BanditPAM++ with a drastically reduced running time.

Problem

Research questions and friction points this paper is trying to address.

Efficient Algorithm

Large-scale Datasets

Midpoint Finding

Innovation

Methods, ideas, or system contributions that make the work stand out.

OneBatchPAM

Resource-Efficient Clustering

Large-Scale Data Processing

🔎 Similar Papers

Towards One Model for Classical Dimensionality Reduction: A Probabilistic Perspective on UMAP and t-SNE