🤖 AI Summary
This paper addresses the slow convergence and poor generalization in federated learning (FL) caused by statistical and system heterogeneity. We propose the first gradient-free, gradient-difference-bound-free implicit zeroth-order bilevel optimization framework for heterogeneous FL. Our method formulates heterogeneous FL as a stochastic zeroth-order bilevel optimization problem: the upper level optimizes the global model—supporting server-side pretraining and non-standard aggregation—while the lower level models personalized local training, accommodating heterogeneous numbers of local steps and constraint-aware updates. Theoretically, we establish the first non-asymptotic convergence rate and almost-sure asymptotic convergence guarantee for such a framework. Empirically, our method significantly outperforms state-of-the-art heterogeneous FL approaches on image classification tasks, demonstrating strong robustness to both data distribution shifts and system-level delays.
📝 Abstract
Heterogeneity in federated learning (FL) is a critical and challenging aspect that significantly impacts model performance and convergence. In this paper, we propose a novel framework by formulating heterogeneous FL as a hierarchical optimization problem. This new framework captures both local and global training process through a bilevel formulation and is capable of the following: (i) addressing client heterogeneity through a personalized learning framework; (ii) capturing pre-training process on server's side; (iii) updating global model through nonstandard aggregation; (iv) allowing for nonidentical local steps; and (v) capturing clients' local constraints. We design and analyze an implicit zeroth-order FL method (ZO-HFL), provided with nonasymptotic convergence guarantees for both the server-agent and the individual client-agents, and asymptotic guarantees for both the server-agent and client-agents in an almost sure sense. Notably, our method does not rely on standard assumptions in heterogeneous FL, such as the bounded gradient dissimilarity condition. We implement our method on image classification tasks and compare with other methods under different heterogeneous settings.