🤖 AI Summary
Microbiome multivariate count data often contain outlying observations that induce bias in parameter estimation under the conventional Dirichlet-multinomial (DM) model. This work proposes a contaminated Dirichlet-multinomial (CDM) distribution—the first of its kind—to construct a two-component Bayesian mixture model that explicitly distinguishes between typical low-dispersion and anomalous high-dispersion observations. By retaining all data points, the CDM enables robust parameter inference and simultaneous outlier detection, naturally down-weighting aberrant samples through posterior probabilities while yielding an interpretable estimate of the contamination proportion. Applied to colorectal cancer microbiome data, the CDM consistently outperforms the standard DM model across multiple information criteria and reveals biologically plausible between-group patterns of anomalies.
📝 Abstract
Multinomial count data, such as microbial composition profiles derived from sequencing studies, frequently contain anomalous observations that distort parameter estimates. The Dirichlet-multinomial (DM) distribution is widely used in this setting but remains sensitive to such contamination. We propose the contaminated Dirichlet-multinomial (CDM) distribution, a two-component mixture in which the regular data come from a DM component with a lower dispersion and the irregular data come from a DM component with an inflated dispersion parameter. This construction accommodates anomalies without requiring their removal, and yields a natural rule for anomaly detection via posterior probabilities. Through sensitivity analyses involving both single-point anomalies and background noise, we demonstrate that the CDM distribution effectively downweights the influence of anomalous observations on the parameter estimates. The model is applied to gut microbiome data from a colorectal carcinogenesis study, where it consistently outperforms the DM distribution across all information criteria and identifies biologically plausible anomaly proportions in both the healthy and carcinoma subsets.