A Contaminated Model for Overdispersed Multinomial Microbiome Count Data

📅 2026-06-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

177K/year
🤖 AI Summary
Microbiome multivariate count data often contain outlying observations that induce bias in parameter estimation under the conventional Dirichlet-multinomial (DM) model. This work proposes a contaminated Dirichlet-multinomial (CDM) distribution—the first of its kind—to construct a two-component Bayesian mixture model that explicitly distinguishes between typical low-dispersion and anomalous high-dispersion observations. By retaining all data points, the CDM enables robust parameter inference and simultaneous outlier detection, naturally down-weighting aberrant samples through posterior probabilities while yielding an interpretable estimate of the contamination proportion. Applied to colorectal cancer microbiome data, the CDM consistently outperforms the standard DM model across multiple information criteria and reveals biologically plausible between-group patterns of anomalies.
📝 Abstract
Multinomial count data, such as microbial composition profiles derived from sequencing studies, frequently contain anomalous observations that distort parameter estimates. The Dirichlet-multinomial (DM) distribution is widely used in this setting but remains sensitive to such contamination. We propose the contaminated Dirichlet-multinomial (CDM) distribution, a two-component mixture in which the regular data come from a DM component with a lower dispersion and the irregular data come from a DM component with an inflated dispersion parameter. This construction accommodates anomalies without requiring their removal, and yields a natural rule for anomaly detection via posterior probabilities. Through sensitivity analyses involving both single-point anomalies and background noise, we demonstrate that the CDM distribution effectively downweights the influence of anomalous observations on the parameter estimates. The model is applied to gut microbiome data from a colorectal carcinogenesis study, where it consistently outperforms the DM distribution across all information criteria and identifies biologically plausible anomaly proportions in both the healthy and carcinoma subsets.
Problem

Research questions and friction points this paper is trying to address.

multinomial count data
microbiome
anomalous observations
overdispersion
contamination
Innovation

Methods, ideas, or system contributions that make the work stand out.

contaminated Dirichlet-multinomial
overdispersed count data
anomaly detection
microbiome composition
robust modeling
O
Ockert van Heerden
Department of Statistics, University of Pretoria, Pretoria, South Africa
A
Andriëtte Bekker
Department of Statistics, University of Pretoria, Pretoria, South Africa; National Institute for Theoretical and Computational Sciences (NITheCS), Pretoria Node, University of Pretoria, South Africa
S
Seite Makgai
Department of Statistics, University of Pretoria, Pretoria, South Africa
A
Arno Otto
Department of Statistics, University of Pretoria, Pretoria, South Africa
Antonio Punzo
Antonio Punzo
Full Professor of Statistics, University of Catania
Mixture ModelsHidden Markov ModelsHeavy-tailed DistributionsSerial DependenceItem Response Theory