Propius: A Platform for Collaborative Machine Learning across the Edge and the Cloud

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Collaborative machine learning faces challenges including non-scalable resource management, poor adaptability to heterogeneous edge devices, and lack of multi-tenancy support. To address these, this paper proposes a scalable edge–cloud collaborative training platform featuring a novel decoupled control plane and data plane architecture. The platform integrates distributed scheduling, multi-tenant isolation, edge-level flow control, model distribution, and result aggregation, along with an adaptive client-side module for heterogeneous devices. It enables fine-grained resource sharing policies and cross-task coordinated scheduling, achieving high resource reuse while preserving tenant isolation. Experimental evaluation demonstrates significant improvements over state-of-the-art frameworks: 1.88× higher resource utilization, up to 2.76× higher throughput, and 1.26× reduction in task completion time.

Technology Category

Application Category

📝 Abstract
Collaborative Machine Learning is a paradigm in the field of distributed machine learning, designed to address the challenges of data privacy, communication overhead, and model heterogeneity. There have been significant advancements in optimization and communication algorithm design and ML hardware that enables fair, efficient and secure collaborative ML training. However, less emphasis is put on collaborative ML infrastructure development. Developers and researchers often build server-client systems for a specific collaborative ML use case, which is not scalable and reusable. As the scale of collaborative ML grows, the need for a scalable, efficient, and ideally multi-tenant resource management system becomes more pressing. We propose a novel system, Propius, that can adapt to the heterogeneity of client machines, and efficiently manage and control the computation flow between ML jobs and edge resources in a scalable fashion. Propius is comprised of a control plane and a data plane. The control plane enables efficient resource sharing among multiple collaborative ML jobs and supports various resource sharing policies, while the data plane improves the scalability of collaborative ML model sharing and result collection. Evaluations show that Propius outperforms existing resource management techniques and frameworks in terms of resource utilization (up to $1.88 imes$), throughput (up to $2.76$), and job completion time (up to $1.26 imes$).
Problem

Research questions and friction points this paper is trying to address.

Developing scalable infrastructure for collaborative machine learning systems
Addressing resource management challenges in multi-tenant ML environments
Optimizing computation flow between ML jobs and edge resources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Propius platform manages edge-cloud ML computation flow
Control plane enables multi-tenant resource sharing policies
Data plane scales model sharing and result collection
🔎 Similar Papers
No similar papers found.