Model-based sparse mixed-type PCA

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of performing principal component analysis on mixed-type data comprising continuous, binary, integer, and positive continuous variables. The authors propose a unified probabilistic latent variable framework in which observed variables are assumed to be generated from exponential family distributions driven by shared Gaussian latent factors. The covariance matrix of these latent variables is estimated via the method of moments, and sparsity constraints are imposed on the loading matrix to yield interpretable sparse principal components. This approach extends classical sparse PCA theory to heterogeneous data settings, seamlessly integrating principal component score estimation with sparse loading recovery. Experiments on both synthetic data and the real-world Zoo dataset demonstrate that the proposed method effectively extracts sparse principal components with clear structure and strong interpretability.

📝 Abstract

This work presents a new method for principal component analysis (PCA) of a mixed-type data consisting of continuous, binary, integer-valued and positive continuous variables. The data are assumed to come from a probability model, where the parameters of the exponential family distributions are determined by a set of shared Gaussian latent variables. The proposed method, MTPCA, is based on estimating the covariance matrix of these latent mixtures through the method of moments. A way to sparsify the component loadings is presented and aligns with the classical theory of sparse PCA. We propose a strategy for estimating the principal component scores and discuss the choice of the latent dimension. The method's performance is studied with a simulated mixed-type data and we illustrate the model on the Zoo data set consisting of binary animal characteristics.

Problem

Research questions and friction points this paper is trying to address.

mixed-type data

principal component analysis

sparse PCA

latent variables

exponential family distributions

Innovation

Methods, ideas, or system contributions that make the work stand out.

mixed-type PCA

exponential family

Gaussian latent variables