🤖 AI Summary
To address the high computational cost and memory overhead of k-medoids clustering on large-scale datasets, this paper proposes an efficient, low-memory approximation algorithm. The method employs a single-batch random sampling strategy with sample size $m = O(log n)$ to asymptotically estimate the objective function, reducing pairwise distance computations from $O(n^2)$ to $O(mn)$. We provide the first theoretical guarantee that this sampling scale ensures, with high probability, convergence performance equivalent to full-data local search. The approach integrates batched sampling-based local search, sample-complexity-driven theoretical analysis, and asymptotic objective estimation. Experiments on multiple real-world datasets demonstrate that our algorithm significantly outperforms FasterPAM and BanditPAM++ in runtime, achieves comparable clustering quality, and drastically reduces memory consumption—achieving an optimal trade-off among speed, accuracy, and memory efficiency.
📝 Abstract
This paper proposes a novel k-medoids approximation algorithm to handle large-scale datasets with reasonable computational time and memory complexity. We develop a local-search algorithm that iteratively improves the medoid selection based on the estimation of the k-medoids objective. A single batch of size m<<n provides the estimation, which reduces the required memory size and the number of pairwise dissimilarities computations to O(mn), instead of O(n^2) compared to most k-medoids baselines. We obtain theoretical results highlighting that a batch of size m = O(log(n)) is sufficient to guarantee, with strong probability, the same performance as the original local-search algorithm. Multiple experiments conducted on real datasets of various sizes and dimensions show that our algorithm provides similar performances as state-of-the-art methods such as FasterPAM and BanditPAM++ with a drastically reduced running time.