🤖 AI Summary
Poisson models often overestimate uncertainty in count data due to their equidispersion assumption, while conditionally underdispersed (i.e., more regular than Poisson) phenomena have long lacked interpretable and computationally tractable modeling tools. This paper introduces the first systematic probabilistic framework based on discrete order statistics (e.g., minimum, median): observations are modeled as specific order statistics of independent and identically distributed discrete latent variables, naturally encoding underlying underdispersion. The resulting model family is both interpretable and flexibly tunable for dispersion control. We further propose a modular data augmentation strategy compatible with foundational distributions—including Poisson and negative binomial—and supporting both Bayesian and frequentist inference. Evaluated on four real-world datasets (flight delays, COVID-19 incidence, avian census, and RNA-seq), our approach achieves significantly improved goodness-of-fit, demonstrating broad applicability and practical utility.
📝 Abstract
The Poisson distribution is the default choice of likelihood for probabilistic models of count data. However, due to the equidispersion contraint of the Poisson, such models may have predictive uncertainty that is artificially inflated. While overdispersion has been extensively studied, conditional underdispersion -- where latent structure renders data more regular than Poisson -- remains underexplored, in part due to the lack of tractable modeling tools. We introduce a new class of models based on discrete order statistics, where observed counts are assumed to be an order statistic (e.g., minimum, median, maximum) of i.i.d. draws from some discrete parent, such as the Poisson or negative binomial. We develop a general data augmentation scheme that is modular with existing tools tailored to the parent distribution, enabling parameter estimation or posterior inference in a wide range of such models. We characterize properties of Poisson and negative binomial order statistics, exposing interpretable knobs on their dispersion. We apply our framework to four case studies -- i.e., to commercial flight times, COVID-19 case counts, Finnish bird abundance, and RNA sequencing data -- and illustrate the flexibility and generality of the proposed framework. Our results suggest that order statistic models can be built, used, and interpreted in much the same way as commonly-used alternatives, while often obtaining better fit, and offer promise in the wide range of applications in which count data arise.