🤖 AI Summary
This work addresses the lack of rigorous theoretical guarantees in traditional $k$-means clustering for real-world data. Leveraging the manifold assumption, the authors model high-dimensional data as a distribution supported on a low-dimensional manifold and employ optimal quantization theory to uncover a scaling law that relates the geometric structure of the data to quantization error. Based on this insight, they propose Qkmeans, an efficient initialization algorithm that avoids strong assumptions about global optimality and instead builds a verifiable data model. Notably, this is the first approach to harness the scaling law from quantization theory to guide clustering initialization, achieving an $O(\rho^{-2} \log k)$-approximate solution in $O(nD) + \widetilde{O}(\varepsilon^{1+\rho}\rho^{-1}k^{1+\gamma})$ time. Extensive cross-domain experiments demonstrate strong alignment between theoretical predictions and empirical performance.
📝 Abstract
We study beyond worst case analysis for the $k$-means problem where the goal is to model typical instances of $k$-means arising in practice. Existing theoretical approaches provide guarantees under certain assumptions on the optimal solutions to $k$-means, making them difficult to validate in practice. We propose the manifold hypothesis, where data obtained in ambient dimension $D$ concentrates around a low dimensional manifold of intrinsic dimension $d$, as a reasonable assumption to model real world clustering instances. We identify key geometric properties of datasets which have theoretically predictable scaling laws depending on the quantization exponent $\varepsilon = 2/d$ using techniques from optimum quantization theory. We show how to exploit these regularities to design a fast seeding method called $\operatorname{Qkmeans}$ which provides $O(\rho^{-2} \log k)$ approximate solutions to the $k$-means problem in time $O(nD) + \widetilde{O}(\varepsilon^{1+\rho}\rho^{-1}k^{1+\gamma})$; where the exponent $\gamma = \varepsilon + \rho$ for an input parameter $\rho<1$. This allows us to obtain new runtime - quality tradeoffs. We perform a large scale empirical study across various domains to validate our theoretical predictions and algorithm performance to bridge theory and practice for beyond worst case data clustering.