🤖 AI Summary
This work addresses the low efficiency of D²-sampling in k-means++ clustering. We propose the first quantum D²-sampling algorithm, achieving Õ(ζ²k²) time complexity under the quantum RAM and sample-and-query model while preserving the optimal O(log k) approximation ratio. Building upon this, we design a dequantizable quantum-inspired algorithm, QI-k-means++, with classical runtime O(Nd) + Õ(ζ²k²d), substantially improving over the standard O(Nk) initialization cost. Furthermore, we construct the first quantum approximate k-means scheme whose query complexity scales logarithmically with dataset size N. Leveraging robust approximation analysis, quantum amplitude estimation, and dequantization techniques, we empirically validate the algorithm’s efficiency and practicality on large-scale, moderately high-dimensional datasets.
📝 Abstract
$D^2$-sampling is a fundamental component of sampling-based clustering algorithms such as $k$-means++. Given a dataset $V subset mathbb{R}^d$ with $N$ points and a center set $C subset mathbb{R}^d$, $D^2$-sampling refers to picking a point from $V$ where the sampling probability of a point is proportional to its squared distance from the nearest center in $C$. Starting with empty $C$ and iteratively $D^2$-sampling and updating $C$ in $k$ rounds is precisely $k$-means++ seeding that runs in $O(Nkd)$ time and gives $O(log{k})$-approximation in expectation for the $k$-means problem. We give a quantum algorithm for (approximate) $D^2$-sampling in the QRAM model that results in a quantum implementation of $k$-means++ that runs in time $ ilde{O}(zeta^2 k^2)$. Here $zeta$ is the aspect ratio (i.e., largest to smallest interpoint distance), and $ ilde{O}$ hides polylogarithmic factors in $N, d, k$. It can be shown through a robust approximation analysis of $k$-means++ that the quantum version preserves its $O(log{k})$ approximation guarantee. Further, we show that our quantum algorithm for $D^2$-sampling can be 'dequantized' using the sample-query access model of Tang (PhD Thesis, Ewin Tang, University of Washington, 2023). This results in a fast quantum-inspired classical implementation of $k$-means++, which we call QI-$k$-means++, with a running time $O(Nd) + ilde{O}(zeta^2k^2d)$, where the $O(Nd)$ term is for setting up the sample-query access data structure. Experimental investigations show promising results for QI-$k$-means++ on large datasets with bounded aspect ratio. Finally, we use our quantum $D^2$-sampling with the known $ D^2$-sampling-based classical approximation scheme (i.e., $(1+varepsilon)$-approximation for any given $varepsilon>0$) to obtain the first quantum approximation scheme for the $k$-means problem with polylogarithmic running time dependence on $N$.