🤖 AI Summary
This paper addresses the challenge of statistical inference for nonsmooth M-estimators—such as quantile regression and AUC maximization—in federated learning, where raw data cannot be shared across sites. We propose a privacy-preserving distributed inference framework comprising two-stage random perturbation, heterogeneity-aware adaptive source-site selection, Lasso-weighted estimator fusion, and MCMC sampling. Theoretically, our estimator is proven to be consistent and asymptotically normal, and possesses the oracle property—thereby avoiding negative transfer and achieving asymptotically optimal efficiency. Extensive simulations and real-world analysis of type 2 diabetes data demonstrate that our method significantly improves both parameter and variance estimation accuracy over existing approaches, particularly under strong data heterogeneity and limited local sample sizes.
📝 Abstract
We propose a novel sampling-based federated learning framework for statistical inference on M-estimators with non-smooth objective functions, which frequently arise in modern statistical applications such as quantile regression and AUC maximization. Classical inference methods for such estimators are often computationally intensive or require nonparametric estimation of nuisance quantities. Our approach circumvents these challenges by leveraging Markov Chain Monte Carlo (MCMC) sampling and a second-stage perturbation scheme to efficiently estimate both the parameter of interest and its variance. In the presence of multiple sites with data-sharing constraints, we introduce an adaptive strategy to borrow information from potentially heterogeneous source sites without transferring individual-level data. This strategy selects source sites based on a dissimilarity measure and constructs an optimally weighted estimator using a lasso regularization. The resulting estimator has an oracle property, i.e., it achieves the optimal asymptotical efficiency by borrowing information from eligible sites while guarding against negative transfer. We establish consistency and asymptotic normality of our proposed estimators and validate the method through extensive simulations and a real-data application on type 2 diabetes. Our results demonstrate substantial gains in inference precision and underscore the importance of inclusive, data-adaptive analysis frameworks in federated learning settings.