A New Rejection Sampling Approach to $k$-$mathtt{means}$++ With Improved Trade-Offs

📅 2025-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost—$O(|mathcal{X}|kd)$—of initial center selection in k-means++ on large-scale datasets, this paper proposes two efficient initialization algorithms based on rejection sampling. The first preserves the classical $O(log k)$ competitive ratio while reducing time complexity to $ ilde{O}(mathtt{nnz}(mathcal{X}) + eta k^2 d)$. The second introduces variance-aware weighted sampling, achieving a superior accuracy–efficiency trade-off with approximation error $k^{-Omega(m/eta)} operatorname{Var}(mathcal{X})$. Both methods tightly integrate sparse matrix operations with theoretically grounded sampling design. Rigorous theoretical analysis guarantees clustering quality, and extensive experiments on real-world large-scale datasets demonstrate speedups of several-fold to over an order of magnitude, while maintaining clustering performance comparable to standard k-means++.

Technology Category

Application Category

📝 Abstract
The $k$-$mathtt{means}$++ seeding algorithm (Arthur&Vassilvitskii, 2007) is widely used in practice for the $k$-means clustering problem where the goal is to cluster a dataset $mathcal{X} subset mathbb{R} ^d$ into $k$ clusters. The popularity of this algorithm is due to its simplicity and provable guarantee of being $O(log k)$ competitive with the optimal solution in expectation. However, its running time is $O(|mathcal{X}|kd)$, making it expensive for large datasets. In this work, we present a simple and effective rejection sampling based approach for speeding up $k$-$mathtt{means}$++. Our first method runs in time $ ilde{O}(mathtt{nnz} (mathcal{X}) + eta k^2d)$ while still being $O(log k )$ competitive in expectation. Here, $eta$ is a parameter which is the ratio of the variance of the dataset to the optimal $k$-$mathtt{means}$ cost in expectation and $ ilde{O}$ hides logarithmic factors in $k$ and $|mathcal{X}|$. Our second method presents a new trade-off between computational cost and solution quality. It incurs an additional scale-invariant factor of $ k^{-Omega( m/eta)} operatorname{Var} (mathcal{X})$ in addition to the $O(log k)$ guarantee of $k$-$mathtt{means}$++ improving upon a result of (Bachem et al, 2016a) who get an additional factor of $m^{-1}operatorname{Var}(mathcal{X})$ while still running in time $ ilde{O}(mathtt{nnz}(mathcal{X}) + mk^2d)$. We perform extensive empirical evaluations to validate our theoretical results and to show the effectiveness of our approach on real datasets.
Problem

Research questions and friction points this paper is trying to address.

Speeds up k-means++ algorithm
Reduces computational cost
Improves solution quality trade-off
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rejection sampling for k-means++
Improved time complexity trade-offs
Scale-invariant factor for quality