Flotilla: A scalable, modular and resilient federated learning framework for heterogeneous resources

📅 2025-07-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing federated learning (FL) frameworks predominantly rely on pseudo-distributed simulation, lacking real-world edge hardware deployment and thus failing to systematically evaluate core FL properties; moreover, most lack support for asynchronous aggregation and exhibit weak fault tolerance. To address these limitations, we propose a highly elastic FL framework tailored for heterogeneous edge resources. Our approach introduces a novel “client-first” modular architecture, featuring stateless clients and decoupled session-state management—enabling both synchronous and asynchronous aggregation modes alongside sub-second fault recovery. The state-separation server design supports periodic and incremental checkpointing, ensures compatibility with mainstream DNN models, and achieves efficient execution on resource-constrained edge devices (e.g., Raspberry Pi, Jetson). Experiments demonstrate robust master–worker fault tolerance with 200+ clients and significantly superior scalability over Flower, OpenFL, and FedML at 1,000+ nodes—while incurring lower resource overhead.

Technology Category

Application Category

📝 Abstract
With the recent improvements in mobile and edge computing and rising concerns of data privacy, Federated Learning(FL) has rapidly gained popularity as a privacy-preserving, distributed machine learning methodology. Several FL frameworks have been built for testing novel FL strategies. However, most focus on validating the learning aspects of FL through pseudo-distributed simulation but not for deploying on real edge hardware in a distributed manner to meaningfully evaluate the federated aspects from a systems perspective. Current frameworks are also inherently not designed to support asynchronous aggregation, which is gaining popularity, and have limited resilience to client and server failures. We introduce Flotilla, a scalable and lightweight FL framework. It adopts a ``user-first'' modular design to help rapidly compose various synchronous and asynchronous FL strategies while being agnostic to the DNN architecture. It uses stateless clients and a server design that separates out the session state, which are periodically or incrementally checkpointed. We demonstrate the modularity of Flotilla by evaluating five different FL strategies for training five DNN models. We also evaluate the client and server-side fault tolerance on 200+ clients, and showcase its ability to rapidly failover within seconds. Finally, we show that Flotilla's resource usage on Raspberry Pis and Nvidia Jetson edge accelerators are comparable to or better than three state-of-the-art FL frameworks, Flower, OpenFL and FedML. It also scales significantly better compared to Flower for 1000+ clients. This positions Flotilla as a competitive candidate to build novel FL strategies on, compare them uniformly, rapidly deploy them, and perform systems research and optimizations.
Problem

Research questions and friction points this paper is trying to address.

Lack of FL frameworks for real edge hardware deployment
Insufficient support for asynchronous FL aggregation
Limited resilience to client and server failures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular design for diverse FL strategies
Stateless clients with resilient server architecture
Scalable performance on edge devices
🔎 Similar Papers
No similar papers found.
Roopkatha Banerjee
Roopkatha Banerjee
PhD Student, Indian Institute of Science (IISc)
Distributed ComputingFederated LearningQuantum ComputingSystems for Machine LearningBlack Hole Astronomy
P
Prince Modi
Department of Computational and Data Sciences (CDS), Indian Institute of Science (IISc), Bangalore 560012, India
J
Jinal Vyas
Department of Computational and Data Sciences (CDS), Indian Institute of Science (IISc), Bangalore 560012, India
C
Chunduru Sri Abhijit
Department of Computational and Data Sciences (CDS), Indian Institute of Science (IISc), Bangalore 560012, India
T
Tejus Chandrashekar
Department of Computational and Data Sciences (CDS), Indian Institute of Science (IISc), Bangalore 560012, India
H
Harsha Varun Marisetty
Birla Institute of Technology and Science (BITS), Pilani, Hyderabad Campus, Hyderabad 500078, India
Manik Gupta
Manik Gupta
Associate Professor @ BITS Pilani, Hyderabad Campus
Sensor NetworksInternet of ThingsEdge AIData ScienceApplied ML
Yogesh Simmhan
Yogesh Simmhan
Associate Professor, Indian Institute of Science
Distributed SystemsEdge AcceleratorsGraph AnalyticsCloud ComputingFederated Learning