A Simple Algorithm for Clustering Discrete Distributions

📅 2026-04-25
📈 Citations: 0
Influential: 0
📄 PDF

career value

223K/year
🤖 AI Summary
This work addresses the challenge of efficiently clustering discrete distributions—such as mixtures of Bernoulli models—while simultaneously handling continuous distributions within a unified framework. To this end, the authors propose a simple projection-based clustering algorithm: it first computes the best rank-$k$ approximation of the data matrix and then applies $k$-means to this low-rank representation to obtain approximate cluster centers. Samples are subsequently projected onto these centers for final clustering assignments. The method is rotationally invariant, thereby avoiding the reliance on coordinate-specific projections inherent in traditional approaches. This approach validates McSherry’s conjecture that a geometric clustering algorithm exists for discrete distributions. Under natural separation conditions on the cluster centers, the algorithm provably achieves accurate clustering for both high-dimensional Gaussian and other continuous distributions as well as discrete ones.

Technology Category

Application Category

📝 Abstract
We propose a simple, projection-based algorithm for clustering mixtures of discrete (Bernoulli) distributions. Unlike previous approaches that rely on coordinate-specific ``combinatorial projections,'' our algorithm is rotationally invariant and works by projecting samples onto approximate centers obtained via a $k$-means computation on the best rank-$k$ approximation of the data matrix. This resolves a conjecture of McSherry on the existence of such geometric algorithms for discrete distributions. The same algorithm also applies to continuous distributions such as high-dimensional Gaussians, providing a unified approach across distribution types. We prove that the algorithm succeeds under a natural separation condition on the cluster centers.
Problem

Research questions and friction points this paper is trying to address.

clustering
discrete distributions
mixture models
rotational invariance
Bernoulli distributions
Innovation

Methods, ideas, or system contributions that make the work stand out.

projection-based clustering
rotationally invariant
discrete distributions
unified clustering framework
rank-k approximation
🔎 Similar Papers
2024-09-01arXiv.orgCitations: 4