π€ AI Summary
Clustering analyses often lack quantitative assessment of reproducibility. To address this gap, this work proposes ERICA, the first systematic framework for quantifying clustering reproducibility. ERICA generates stability statistics through iterative cluster assignments and integrates quantitative visualization to reveal inter-cluster similarity and potential outliers. The method is validated on synthetic datasets and applied to breast cancer gene expression data, where it identifies subsets of clustering results that are irreproducible. These findings underscore ERICAβs critical value in real-world applications for evaluating the reliability of clustering outcomes and the robustness of underlying data structures.
π Abstract
Despite being ubiquitous in science, clustering remains a technique whose results are not quantitatively scrutinized via a framework. We present an analysis called evaluating replicability via iterative clustering assignments (ERICA) that is applied to a dataset to determine whether clusters are identified in a replicable manner. The pipeline computes a statistic that describes whether structure is found in a dataset. Quantitative visualization methods are presented to answer important questions such as the similarity between clusters, and the identity of points that may be outliers. When tested on synthetic data, the findings show clusters being discovered in a replicable manner. However, we note a possibility for non-replicable results when the pipeline is applied to three gene expression datasets for breast cancer subtype validation. The study underscores the need for rigorous inspection and offers a practical tool for doing so.