🤖 AI Summary
This study addresses the ecological inference problem—specifically, how to accurately estimate individual-level conditional means from marginal means of aggregated data, particularly in the presence of confounding variables and nonlinear relationships that pose identification and bias challenges. The authors propose a novel formal identification framework that clarifies the implicit role of linear structure in the aggregation process and integrates confounder control within this structure to refine existing inference methods. Both theoretical analysis and empirical experiments demonstrate that the proposed approach effectively leverages covariates to improve estimation accuracy. Evaluation on real labeled datasets further reveals that prevailing methods systematically overestimate both racial polarization and the extent of national partisan voting patterns.
📝 Abstract
Estimating conditional means using only the marginal means available from aggregate data is commonly known as the ecological inference problem (EI). We provide a reassessment of EI, including a new formalization of identification conditions and a demonstration of how these conditions fail to hold in common cases. The identification conditions reveal that, similar to causal inference, credible ecological inference requires controlling for confounders. The aggregation process itself creates additional structure to assist in estimation by restricting the conditional expectation function to be linear in the predictor variable. A linear model perspective also clarifies the differences between the EI methods commonly used in the literature, and when they lead to ecological fallacies. We provide an overview of new methodology which builds on both the identification and linearity results to flexibly control for confounders and yield improved ecological inferences. Finally, using datasets for common EI problems in which the ground truth is fortuitously observed, we show that, while covariates can help, all methods are prone to overestimating both racial polarization and nationalized partisan voting.