🤖 AI Summary
This work addresses the challenge of performing principal component analysis on mixed-type data comprising continuous, binary, integer, and positive continuous variables. The authors propose a unified probabilistic latent variable framework in which observed variables are assumed to be generated from exponential family distributions driven by shared Gaussian latent factors. The covariance matrix of these latent variables is estimated via the method of moments, and sparsity constraints are imposed on the loading matrix to yield interpretable sparse principal components. This approach extends classical sparse PCA theory to heterogeneous data settings, seamlessly integrating principal component score estimation with sparse loading recovery. Experiments on both synthetic data and the real-world Zoo dataset demonstrate that the proposed method effectively extracts sparse principal components with clear structure and strong interpretability.
📝 Abstract
This work presents a new method for principal component analysis (PCA) of a mixed-type data consisting of continuous, binary, integer-valued and positive continuous variables. The data are assumed to come from a probability model, where the parameters of the exponential family distributions are determined by a set of shared Gaussian latent variables. The proposed method, MTPCA, is based on estimating the covariance matrix of these latent mixtures through the method of moments. A way to sparsify the component loadings is presented and aligns with the classical theory of sparse PCA. We propose a strategy for estimating the principal component scores and discuss the choice of the latent dimension. The method's performance is studied with a simulated mixed-type data and we illustrate the model on the Zoo data set consisting of binary animal characteristics.