Zero-inflation in the Multivariate Poisson Lognormal Family

📅 2024-05-23
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
Existing Poisson-log-normal (PLN) models struggle with zero-inflation—a pervasive issue in high-dimensional count data. To address this, we propose the Zero-Inflated PLN (ZIPLN) model, the first to integrate a multivariate zero-inflation component into the PLN framework, supporting both covariate-driven and feature-specific zero-generation mechanisms. ZIPLN jointly models zero generation and count processes via a Bernoulli–Gaussian coupling structure under a unified Bayesian framework. We design two conditional variational inference algorithms that balance estimation accuracy and scalability. Evaluated on synthetic data with 90% zero proportion and real bovine rumen microbiome data, ZIPLN achieves a +12.7% improvement in log-likelihood over baseline PLN models, enhances discriminability in the latent space, and effectively mitigates overdispersion. This work establishes a novel paradigm for analyzing high-dimensional, sparse count data.

Technology Category

Application Category

📝 Abstract
Analyzing high-dimensional count data is a challenge and statistical model-based approaches provide an adequate and efficient framework that preserves explainability. The (multivariate) Poisson-Log-Normal (PLN) model is one such model: it assumes count data are driven by an underlying structured latent Gaussian variable, so that the dependencies between counts solely stems from the latent dependencies. However PLN doesn't account for zero-inflation, a feature frequently observed in real-world datasets. Here we introduce the Zero-Inflated PLN (ZIPLN) model, adding a multivariate zero-inflated component to the model, as an additional Bernoulli latent variable. The Zero-Inflation can be fixed, site-specific, feature-specific or depends on covariates. We estimate model parameters using variational inference that scales up to datasets with a few thousands variables and compare two approximations: (i) independent Gaussian and Bernoulli variational distributions or (ii) Gaussian variational distribution conditioned on the Bernoulli one. The method is assessed on synthetic data and the efficiency of ZIPLN is established even when zero-inflation concerns up to $90%$ of the observed counts. We then apply both ZIPLN and PLN to a cow microbiome dataset, containing $90.6%$ of zeroes. Accounting for zero-inflation significantly increases log-likelihood and reduces dispersion in the latent space, thus leading to improved group discrimination.
Problem

Research questions and friction points this paper is trying to address.

Addresses zero-inflation in multivariate count data models
Extends Poisson-Log-Normal model with zero-inflation component
Handles high-dimensional datasets with excessive zero counts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-Inflated PLN model with Bernoulli latent
Variational inference for parameter estimation
Conditional Gaussian-Bernoulli variational distributions
🔎 Similar Papers
No similar papers found.
B
Bastien Batardiere
Universite Paris-Saclay, AgroParisTech, INRAE, UMR MIA Paris-Saclay, 91120, Palaiseau, France
Julien Chiquet
Julien Chiquet
Université Paris-Saclay, INRAE, AgroParisTech
StatisticsMachine LearningComputational Biology
F
Francois Gindraud
Universite Lyon 1, CNRS, Laboratoire de Biometrie et Biologie Evolutive UMR 5558, F-69622, Villeurbanne, France
M
Mahendra Mariadassou
Universite Paris-Saclay, INRAE, MaIAGE, 78350, Jouy-en-Josas, France