Data Integration With Biased Summary Data via Generalized Entropy Balancing

📅 2025-06-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of integrating individual-level data with external aggregate data exhibiting distributional shift, where conventional methods assume identical distributions across data sources. We propose Generalized Entropy Balancing (GEB), a novel framework enabling unbiased causal effect estimation without requiring distributional homogeneity between internal and external data. GEB unifies inverse probability weighting with moment-matching constraints in an optimization framework, achieving double robustness and permitting direct diagnostic assessment of applicability using only observed data. As the first method capable of robustly incorporating biased aggregate data, GEB demonstrates substantial improvements in estimation accuracy in an empirical study on nationwide public automated external defibrillator deployment in Japan. It effectively mitigates bias induced by distributional shift in external data, establishing a new paradigm for causal inference under multi-source, heterogeneous data settings.

Technology Category

Application Category

📝 Abstract
Statistical methods for integrating individual-level data with external summary data have attracted attention because of their potential to reduce data collection costs. Summary data are often accessible through public sources and relatively easy to obtain, making them a practical resource for enhancing the precision of statistical estimation. Typically, these methods assume that internal and external data originate from the same underlying distribution. However, when this assumption is violated, incorporating external data introduces the risk of bias, primarily due to differences in background distributions between the current study and the external source. In practical applications, the primary interest often lies not in statistical quantities related specifically to the external data distribution itself, but in the individual-level internal data. In this paper, we propose a methodology based on generalized entropy balancing, designed to integrate external summary data even if derived from biased samples. Our method demonstrates double robustness, providing enhanced protection against model misspecification. Importantly, the applicability of our method can be assessed directly from the available data. We illustrate the versatility and effectiveness of the proposed estimator through an analysis of Nationwide Public-Access Defibrillation data in Japan.
Problem

Research questions and friction points this paper is trying to address.

Integrating biased summary data with individual-level data
Addressing bias from differing internal and external data distributions
Ensuring robustness in statistical estimation with model misspecification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalized entropy balancing for biased data
Double robustness against model misspecification
Applicability assessed from available data
🔎 Similar Papers
Kosuke Morikawa
Kosuke Morikawa
Iowa State University
Incomplete data analysismissing data analysissemiparametric estimationmodel selection
S
S. Komukai
Department of Health Data Science, Tokyo Medical University
S
Satoshi Hattori
Department of Biomedical Statistics, The University of Osaka, Integrated Frontier Research for Medical Science Division, Institute for Open and Transdisciplinary Research Initiatives (OTRI), The University of Osaka