🤖 AI Summary
To address the high computational cost—$O(|mathcal{X}|kd)$—of initial center selection in k-means++ on large-scale datasets, this paper proposes two efficient initialization algorithms based on rejection sampling. The first preserves the classical $O(log k)$ competitive ratio while reducing time complexity to $ ilde{O}(mathtt{nnz}(mathcal{X}) + eta k^2 d)$. The second introduces variance-aware weighted sampling, achieving a superior accuracy–efficiency trade-off with approximation error $k^{-Omega(m/eta)} operatorname{Var}(mathcal{X})$. Both methods tightly integrate sparse matrix operations with theoretically grounded sampling design. Rigorous theoretical analysis guarantees clustering quality, and extensive experiments on real-world large-scale datasets demonstrate speedups of several-fold to over an order of magnitude, while maintaining clustering performance comparable to standard k-means++.
📝 Abstract
The $k$-$mathtt{means}$++ seeding algorithm (Arthur&Vassilvitskii, 2007) is widely used in practice for the $k$-means clustering problem where the goal is to cluster a dataset $mathcal{X} subset mathbb{R} ^d$ into $k$ clusters. The popularity of this algorithm is due to its simplicity and provable guarantee of being $O(log k)$ competitive with the optimal solution in expectation. However, its running time is $O(|mathcal{X}|kd)$, making it expensive for large datasets. In this work, we present a simple and effective rejection sampling based approach for speeding up $k$-$mathtt{means}$++. Our first method runs in time $ ilde{O}(mathtt{nnz} (mathcal{X}) + eta k^2d)$ while still being $O(log k )$ competitive in expectation. Here, $eta$ is a parameter which is the ratio of the variance of the dataset to the optimal $k$-$mathtt{means}$ cost in expectation and $ ilde{O}$ hides logarithmic factors in $k$ and $|mathcal{X}|$. Our second method presents a new trade-off between computational cost and solution quality. It incurs an additional scale-invariant factor of $ k^{-Omega( m/eta)} operatorname{Var} (mathcal{X})$ in addition to the $O(log k)$ guarantee of $k$-$mathtt{means}$++ improving upon a result of (Bachem et al, 2016a) who get an additional factor of $m^{-1}operatorname{Var}(mathcal{X})$ while still running in time $ ilde{O}(mathtt{nnz}(mathcal{X}) + mk^2d)$. We perform extensive empirical evaluations to validate our theoretical results and to show the effectiveness of our approach on real datasets.