A Contaminated Model for Overdispersed Multinomial Microbiome Count Data

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Microbiome multivariate count data often contain outlying observations that induce bias in parameter estimation under the conventional Dirichlet-multinomial (DM) model. This work proposes a contaminated Dirichlet-multinomial (CDM) distribution—the first of its kind—to construct a two-component Bayesian mixture model that explicitly distinguishes between typical low-dispersion and anomalous high-dispersion observations. By retaining all data points, the CDM enables robust parameter inference and simultaneous outlier detection, naturally down-weighting aberrant samples through posterior probabilities while yielding an interpretable estimate of the contamination proportion. Applied to colorectal cancer microbiome data, the CDM consistently outperforms the standard DM model across multiple information criteria and reveals biologically plausible between-group patterns of anomalies.

📝 Abstract

Multinomial count data, such as microbial composition profiles derived from sequencing studies, frequently contain anomalous observations that distort parameter estimates. The Dirichlet-multinomial (DM) distribution is widely used in this setting but remains sensitive to such contamination. We propose the contaminated Dirichlet-multinomial (CDM) distribution, a two-component mixture in which the regular data come from a DM component with a lower dispersion and the irregular data come from a DM component with an inflated dispersion parameter. This construction accommodates anomalies without requiring their removal, and yields a natural rule for anomaly detection via posterior probabilities. Through sensitivity analyses involving both single-point anomalies and background noise, we demonstrate that the CDM distribution effectively downweights the influence of anomalous observations on the parameter estimates. The model is applied to gut microbiome data from a colorectal carcinogenesis study, where it consistently outperforms the DM distribution across all information criteria and identifies biologically plausible anomaly proportions in both the healthy and carcinoma subsets.

Problem

Research questions and friction points this paper is trying to address.

multinomial count data

microbiome

anomalous observations

overdispersion

contamination

Innovation

Methods, ideas, or system contributions that make the work stand out.

contaminated Dirichlet-multinomial

overdispersed count data

anomaly detection