Federated generalized linear mixed models based on one-time shared summary statistics

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This study addresses key limitations of existing estimation methods for generalized linear mixed models (GLMMs) under privacy constraints—namely, ecological bias, inadequate handling of heterogeneity, and excessive communication overhead. The authors propose a novel approach that generates synthetic individual-level data from one-time shared summary statistics, enabling accurate parameter estimation in linear, logistic, and Poisson mixed models. By reconstructing pseudo-data that closely approximates the original individual records through moment-matching, the method achieves estimates virtually indistinguishable from those obtained with true data (agreeing to three decimal places). Critically, it incurs negligible losses in estimation bias, confidence interval coverage, or predictive performance while requiring only a single exchange of aggregated information per participant. This work thus represents the first framework to simultaneously ensure strong privacy protection, minimal communication cost, and high statistical fidelity in distributed GLMM estimation.

📝 Abstract

Data privacy has increasingly become a daunting challenge because it limits data availability, which is essential in estimating statistical models such as generalized linear mixed models. Access to personal data often involves considerable time, effort, and paperwork, which can impede research progress and collaboration. Existing approaches that do not use individual-level data for model estimation are either prone to ecological bias, cannot handle heterogeneity, or require iterative communication. In this paper, we propose an approach to estimate generalized linear mixed models based on summary statistics shared only once. We used linear, logistic, and Poisson mixed models as examples to demonstrate the methodology. Our strategy involves generating pseudo-data whose summary statistics match those of the actual but unavailable data. These pseudo-data are then used for model estimation instead of the actual data. The estimates we achieve are identical (up to the third decimal place) to those derived from actual data and have similar bias, coverage, and prediction performance. Communication and resource efficiency distinguish our approach from existing methods.

Problem

Research questions and friction points this paper is trying to address.

federated learning

generalized linear mixed models

data privacy

summary statistics

ecological bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

federated learning

generalized linear mixed models

summary statistics