🤖 AI Summary
Clustering ambiguity—arising from density heterogeneity and multimodal structures—hampers unique recovery of ground-truth clusters, especially when the number of clusters (K) is known a priori. Method: This paper formally defines *clustering recoverability* and establishes an information-theoretic, decidable condition for unambiguous (K)-clustering. Based on this theory, we propose a two-stage algorithm: (i) a density-adaptive seed discovery phase coupled with an information-theoretic criterion to determine whether unambiguous (K)-clustering exists; and (ii) a greedy expansion strategy for exact cluster recovery—without assuming uniform density—and incorporating an overlap-robustification mechanism. Results: Extensive experiments demonstrate that our method significantly outperforms DBSCAN, spectral clustering, and HDBSCAN on diverse non-convex, overlapping, and variable-density datasets. It exhibits strong parameter robustness and high reproducibility.
📝 Abstract
Clustering is often a challenging problem because of the inherent ambiguity in what the"correct"clustering should be. Even when the number of clusters $K$ is known, this ambiguity often still exists, particularly when there is variation in density among different clusters, and clusters have multiple relatively separated regions of high density. In this paper we propose an information-theoretic characterization of when a $K$-clustering is ambiguous, and design an algorithm that recovers the clustering whenever it is unambiguous. This characterization formalizes the situation when two high density regions within a cluster are separable enough that they look more like two distinct clusters than two truly distinct clusters in the clustering. The algorithm first identifies $K$ partial clusters (or"seeds") using a density-based approach, and then adds unclustered points to the initial $K$ partial clusters in a greedy manner to form a complete clustering. We implement and test a version of the algorithm that is modified to effectively handle overlapping clusters, and observe that it requires little parameter selection and displays improved performance on many datasets compared to widely used algorithms for non-convex cluster recovery.