🤖 AI Summary
In cross-institutional, siloed federated learning (FL), participant failures—such as communication disruptions or configuration errors—are frequent, yet their impact on model quality remains poorly understood, especially in small-scale, non-IID settings.
Method: This work presents the first empirical study of the multidimensional impact of participant failures, systematically analyzing three critical factors: failure timing, degree of local data skew, and evaluation bias.
Contribution/Results: We find that failure timing significantly affects convergence behavior and final model performance. Under high data skew, standard evaluation metrics exhibit optimistic bias, obscuring the true extent of model degradation. Crucially, conventional offline evaluation proves unreliable for deployment decisions, often yielding misleading conclusions. Our study provides a reproducible analytical framework and actionable design insights for building robust FL systems—highlighting the need for failure-aware evaluation protocols and adaptive aggregation strategies in heterogeneous, real-world FL deployments.
📝 Abstract
Federated learning (FL) is a new paradigm for training machine learning (ML) models without sharing data. While applying FL in cross-silo scenarios, where organizations collaborate, it is necessary that the FL system is reliable; however, participants can fail due to various reasons (e.g., communication issues or misconfigurations). In order to provide a reliable system, it is necessary to analyze the impact of participant failures. While this problem received attention in cross-device FL where mobile devices with limited resources participate, there is comparatively little research in cross-silo FL.
Therefore, we conduct an extensive study for analyzing the impact of participant failures on the model quality in the context of inter-organizational cross-silo FL with few participants. In our study, we focus on analyzing generally influential factors such as the impact of the timing and the data as well as the impact on the evaluation, which is important for deciding, if the model should be deployed. We show that under high skews the evaluation is optimistic and hides the real impact. Furthermore, we demonstrate that the timing impacts the quality of the trained model. Our results offer insights for researchers and software architects aiming to build robust FL systems.