🤖 AI Summary
To address the challenge of manually specifying both the latent space dimensionality and the number of clusters in low-dimensional representation and clustering of network data, this paper proposes a Bayesian nonparametric latent position model that jointly infers these two quantities in a fully adaptive manner. Methodologically, the model employs a shrinkage prior on latent variables to automatically identify the effective dimensionality, while integrating a sparse finite Gaussian mixture model (GMM) to adaptively determine the optimal number of clusters; full Bayesian inference is performed via Markov Chain Monte Carlo (MCMC). Experiments on synthetic datasets and real-world Twitter social networks—covering sports and political domains—demonstrate that the proposed model significantly outperforms baseline methods requiring pre-specified dimensions or cluster counts, achieving superior clustering accuracy and representation quality. The model retains strong statistical interpretability and practical usability. An open-source implementation is provided for immediate use.
📝 Abstract
Low-dimensional representation and clustering of network data are tasks of great interest across various fields. Latent position models are routinely used for this purpose by assuming that each node has a location in a low-dimensional latent space, and enabling node clustering. However, these models fall short in simultaneously determining the optimal latent space dimension and the number of clusters. Here we introduce the latent shrinkage position cluster model (LSPCM), which addresses this limitation. The LSPCM posits a Bayesian nonparametric shrinkage prior on the latent positions' variance parameters resulting in higher dimensions having increasingly smaller variances, aiding in the identification of dimensions with non-negligible variance. Further, the LSPCM assumes the latent positions follow a sparse finite Gaussian mixture model, allowing for automatic inference on the number of clusters related to non-empty mixture components. As a result, the LSPCM simultaneously infers the latent space dimensionality and the number of clusters, eliminating the need to fit and compare multiple models. The performance of the LSPCM is assessed via simulation studies and demonstrated through application to two real Twitter network datasets from sporting and political contexts. Open source software is available to promote widespread use of the LSPCM.