🤖 AI Summary
This work addresses the challenge of efficiently clustering discrete distributions—such as mixtures of Bernoulli models—while simultaneously handling continuous distributions within a unified framework. To this end, the authors propose a simple projection-based clustering algorithm: it first computes the best rank-$k$ approximation of the data matrix and then applies $k$-means to this low-rank representation to obtain approximate cluster centers. Samples are subsequently projected onto these centers for final clustering assignments. The method is rotationally invariant, thereby avoiding the reliance on coordinate-specific projections inherent in traditional approaches. This approach validates McSherry’s conjecture that a geometric clustering algorithm exists for discrete distributions. Under natural separation conditions on the cluster centers, the algorithm provably achieves accurate clustering for both high-dimensional Gaussian and other continuous distributions as well as discrete ones.
📝 Abstract
We propose a simple, projection-based algorithm for clustering mixtures of discrete (Bernoulli) distributions. Unlike previous approaches that rely on coordinate-specific ``combinatorial projections,'' our algorithm is rotationally invariant and works by projecting samples onto approximate centers obtained via a $k$-means computation on the best rank-$k$ approximation of the data matrix. This resolves a conjecture of McSherry on the existence of such geometric algorithms for discrete distributions. The same algorithm also applies to continuous distributions such as high-dimensional Gaussians, providing a unified approach across distribution types. We prove that the algorithm succeeds under a natural separation condition on the cluster centers.