Prediction-Assisted Online Distributed Deep Learning Workload Scheduling in GPU Clusters

📅 2025-01-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Online scheduling of mixed-parallel distributed deep learning training on GPU clusters remains challenging due to dynamic workload characteristics, communication overhead, and configuration heterogeneity. Method: This paper proposes Adaptive Shortest Remaining Processing Time (A-SRPT), the first approach to model multi-node collaborative scheduling as a single-machine optimal problem. It represents DNN jobs and training configurations via a graph structure, employs random forest regression for dynamic runtime prediction, and integrates a preemptive SRPT policy with provably bounded competitive ratio. Contribution/Results: Experiments on real GPU clusters and large-scale simulations demonstrate that A-SRPT significantly reduces average job completion time and tail latency. It improves scheduling efficiency by 23%–41% over state-of-the-art baselines and establishes the first distributed training scheduler jointly optimizing communication awareness, configuration awareness, and dynamic preemption.

Technology Category

Application Category

📝 Abstract
The recent explosive growth of deep learning (DL) models has necessitated a compelling need for efficient job scheduling for distributed deep learning training with mixed parallelisms (DDLwMP) in GPU clusters. This paper proposes an adaptive shortest-remaining-processing-time-first (A-SRPT) scheduling algorithm, a novel prediction-assisted online scheduling approach designed to mitigate the challenges associated with DL cluster scheduling. By modeling each job as a graph corresponding to heterogeneous Deep Neural Network (DNN) models and their associated distributed training configurations, A-SRPT strategically assigns jobs to the available GPUs, thereby minimizing inter-server communication overhead. Observing that most DDLwMP jobs recur, A-SRPT incorporates a random forest regression model to predict training iterations. Crucially, A-SRPT maps the complex scheduling problem into a single-machine instance, which is addressed optimally by a preemptive"shortest-remaining-processing-time-first"strategy. This optimized solution serves as a guide for actual job scheduling within the GPU clusters, leading to a theoretically provable competitive scheduling efficiency. We conduct extensive real-world testbed and simulation experiments to verify our proposed algorithms.
Problem

Research questions and friction points this paper is trying to address.

Distributed Training
GPU Clustering
Communication Overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

A-SRPT
Prediction-assisted Scheduling
Distributed Deep Learning
🔎 Similar Papers
No similar papers found.
Ziyue Luo
Ziyue Luo
Postdoctoral Researcher, Ohio State University
J
Jia Liu
Dept. of ECE, The Ohio State University, USA
Myungjin Lee
Myungjin Lee
Cisco Systems
NetworkingSystems
N
N. Shroff
Dept. of ECE, The Ohio State University, USA