🤖 AI Summary
This work addresses the challenges of large-scale model training in scientific applications, where data are often distributed across institutions due to privacy, sovereignty, or volume constraints, and federated learning across supercomputing centers faces significant heterogeneity and scheduling complexity. The authors propose a cross-facility federated learning framework tailored for heterogeneous high-performance computing (HPC) environments, integrating APPFL with Globus Compute and Transfer to orchestrate computation tasks and manage data movement. They demonstrate the first successful federated fine-tuning of large language models across four U.S. Department of Energy leadership-class supercomputers using a chemistry instruction dataset, confirming the feasibility of such training at scale. Their experiments further reveal the critical impact of HPC system heterogeneity on performance, pointing toward a new direction in scheduling-aware algorithm design for federated learning in HPC settings.
📝 Abstract
Artificial Intelligence for scientific applications increasingly requires training large models on data that cannot be centralized due to privacy constraints, data sovereignty, or the sheer volume of data generated. Federated learning (FL) addresses this by enabling collaborative training without centralizing raw data, but scientific applications demand model scales that requires extensive computing resources, typically offered at High Performance Computing (HPC) facilities. Deploying FL experiments across HPC facilities introduces challenges beyond cloud or enterprise settings. We present a comprehensive cross-facility FL framework for heterogeneous HPC environments, built on Advanced Privacy-Preserving Federated Learning (APPFL) framework with Globus Compute and Transfer orchestration, and evaluate it across four U.S. Department of Energy (DOE) leadership-class supercomputers. We demonstrate that FL experiments across HPC facilities are practically achievable, characterize key sources of heterogeneity impacting the training performance, and show that algorithmic choices matter significantly under realistic HPC scheduling conditions. We validate the scientific applicability by fine-tuning a large language model on a chemistry instruction dataset, and identify scheduler-aware algorithm design as a critical open challenge for future deployments.