🤖 AI Summary
Conventional deep latent variable models (e.g., VAEs) employ simplistic priors—such as standard normal distributions—limiting their clustering capability; while Gaussian mixture model (GMM) priors improve performance, they require pre-specifying the number of clusters and are sensitive to initialization.
Method: For single-cell RNA-seq data analysis, we propose the Variational Mixture Model (VMM) prior—the first integration of VampPrior with a Dirichlet process Gaussian mixture model—enabling automatic inference of the number of clusters. We further design an alternating optimization framework combining variational inference and empirical Bayes to decouple learning of latent variables from prior parameters.
Contribution/Results: The method requires no pre-specified cluster count and exhibits strong robustness. It achieves state-of-the-art clustering performance across multiple benchmark datasets. When integrated into scVI, it significantly enhances cross-batch integration and yields biologically interpretable cell clusters.
📝 Abstract
Current clustering priors for deep latent variable models (DLVMs) require defining the number of clusters a-priori and are susceptible to poor initializations. Addressing these deficiencies could greatly benefit deep learning-based scRNA-seq analysis by performing integration and clustering simultaneously. We adapt the VampPrior (Tomczak&Welling, 2018) into a Dirichlet process Gaussian mixture model, resulting in the VampPrior Mixture Model (VMM), a novel prior for DLVMs. We propose an inference procedure that alternates between variational inference and Empirical Bayes to cleanly distinguish variational and prior parameters. Using the VMM in a Variational Autoencoder attains highly competitive clustering performance on benchmark datasets. Augmenting scVI (Lopez et al., 2018), a popular scRNA-seq integration method, with the VMM significantly improves its performance and automatically arranges cells into biologically meaningful clusters.