Fast $k$-means Seeding Under The Manifold Hypothesis

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

265K/year

🤖 AI Summary

This work addresses the lack of rigorous theoretical guarantees in traditional $k$-means clustering for real-world data. Leveraging the manifold assumption, the authors model high-dimensional data as a distribution supported on a low-dimensional manifold and employ optimal quantization theory to uncover a scaling law that relates the geometric structure of the data to quantization error. Based on this insight, they propose Qkmeans, an efficient initialization algorithm that avoids strong assumptions about global optimality and instead builds a verifiable data model. Notably, this is the first approach to harness the scaling law from quantization theory to guide clustering initialization, achieving an $O(\rho^{-2} \log k)$-approximate solution in $O(nD) + \widetilde{O}(\varepsilon^{1+\rho}\rho^{-1}k^{1+\gamma})$ time. Extensive cross-domain experiments demonstrate strong alignment between theoretical predictions and empirical performance.

Technology Category

Application Category

📝 Abstract

We study beyond worst case analysis for the $k$-means problem where the goal is to model typical instances of $k$-means arising in practice. Existing theoretical approaches provide guarantees under certain assumptions on the optimal solutions to $k$-means, making them difficult to validate in practice. We propose the manifold hypothesis, where data obtained in ambient dimension $D$ concentrates around a low dimensional manifold of intrinsic dimension $d$, as a reasonable assumption to model real world clustering instances. We identify key geometric properties of datasets which have theoretically predictable scaling laws depending on the quantization exponent $\varepsilon = 2/d$ using techniques from optimum quantization theory. We show how to exploit these regularities to design a fast seeding method called $\operatorname{Qkmeans}$ which provides $O(\rho^{-2} \log k)$ approximate solutions to the $k$-means problem in time $O(nD) + \widetilde{O}(\varepsilon^{1+\rho}\rho^{-1}k^{1+\gamma})$; where the exponent $\gamma = \varepsilon + \rho$ for an input parameter $\rho<1$. This allows us to obtain new runtime - quality tradeoffs. We perform a large scale empirical study across various domains to validate our theoretical predictions and algorithm performance to bridge theory and practice for beyond worst case data clustering.

Problem

Research questions and friction points this paper is trying to address.

k-means

manifold hypothesis

beyond worst case analysis

clustering

optimal seeding

Innovation

Methods, ideas, or system contributions that make the work stand out.

manifold hypothesis

optimal quantization

k-means seeding