A Communication-Efficient Distributed Algorithm for Learning with Heterogeneous and Structurally Incomplete Multi-Site Data

📅 2025-12-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multi-center biomedical studies, data are decentralized, privacy-sensitive, and exhibit dual heterogeneity—non-IID distributions across sites and feature heterogeneity (i.e., missing or mismatched covariates). To address these challenges, we propose a communication-efficient, privacy-preserving distributed learning framework that avoids raw-data sharing. Our approach unifies modeling of both distributional and structural heterogeneity via a novel density-tilted generalized method of moments (GMM) estimator, enabling heterogeneity-aware global statistical inference. We establish theoretical guarantees: the proposed estimator is consistent and asymptotically normal under mild regularity conditions. Extensive simulations demonstrate that, under concurrent distributional shift and feature heterogeneity, our method achieves significantly higher statistical efficiency and robustness compared to existing distributed algorithms.

Technology Category

Application Category

📝 Abstract
In multicenter biomedical research, integrating data from multiple decentralized sites provides more robust and generalizable findings due to its larger sample size and the ability to account for the between-site heterogeneity. However, sharing individual-level data across sites is often difficult due to patient privacy concerns and regulatory restrictions. To overcome this challenge, many distributed algorithms, that fit a global model by only communicating aggregated information across sites, have been proposed. A major challenge in applying existing distributed algorithms to real-world data is that their validity often relies on the assumption that data across sites are independently and identically distributed, which is frequently violated in practice. In biomedical applications, data distributions across clinical sites can be heterogeneous. Additionally, the set of covariates available at each site may vary due to different data collection protocols. We propose a distributed inference framework for data integration in the presence of both distribution heterogeneity and data structural heterogeneity. By modeling heterogeneous and structurally missing data using density-tilted generalized method of moments, we developed a general aggregated data-based distributed algorithm that is communication-efficient and heterogeneity-aware. We establish the asymptotic properties of our estimator and demonstrate the validity of our method via simulation studies.
Problem

Research questions and friction points this paper is trying to address.

Develops a distributed algorithm for multi-site data integration
Addresses data heterogeneity and structural incompleteness across sites
Ensures communication efficiency while maintaining statistical validity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Density-tilted generalized method of moments modeling
Communication-efficient aggregated data-based distributed algorithm
Heterogeneity-aware framework for structurally incomplete data
🔎 Similar Papers
No similar papers found.
X
Xiaokang Liu
Department of Statistics and Data Science, University of Missouri
Y
Yuchen Yang
Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania
Y
Yifei Sun
Department of Biostatistics, Columbia University
J
Jiang Bian
Department of Biostatistics and Health Data Science, Indiana University
Y
Yanyuan Ma
Department of Statistics, The Pennsylvania State University
Raymond J. Carroll
Raymond J. Carroll
Texas A&M University
StatisticsEpidemiology
Y
Yong Chen
Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania