🤖 AI Summary
To address the high overhead, poor dynamic adaptability, and low resource utilization of conventional software schedulers in heterogeneous HPC environments, this paper proposes the first FPGA-based hardware accelerator for stochastic online scheduling. Our approach leverages hardware parallelism, reconstructs the cost function using discrete-time modeling, and employs quantized arithmetic with a greedy cost-selection strategy to achieve low-latency, high-throughput, and energy-efficient real-time scheduling. The key contribution is the first complete hardware implementation of a stochastic online scheduling policy, establishing a novel adaptive scheduling paradigm tailored to performance heterogeneity across diverse compute units. Experimental evaluation demonstrates that the accelerator achieves up to 1060× speedup over single-threaded software scheduling, significantly improves load balancing and device fairness, and effectively supports large-scale HPC workloads and deep learning training.
📝 Abstract
Efficient workload scheduling is a critical challenge in modern heterogeneous computing environments, particularly in high-performance computing (HPC) systems. Traditional software-based schedulers struggle to efficiently balance workload distribution due to high scheduling overhead, lack of adaptability to dynamic workloads, and suboptimal resource utilization. These pitfalls are compounded in heterogeneous systems, where differing computational elements can have vastly different performance profiles. To resolve these hindrances, we present a novel FPGA-based accelerator for stochastic online scheduling (SOS). We modify a greedy cost selection assignment policy by adapting existing cost equations to engage with discretized time before implementing them into a hardware accelerator design. Our design leverages hardware parallelism, precalculation, and precision quantization to reduce job scheduling latency. By introducing a hardware-accelerated approach to real-time scheduling, this paper establishes a new paradigm for adaptive scheduling mechanisms in heterogeneous computing systems. The proposed design achieves high throughput, low latency, and energy-efficient operation, offering a scalable alternative to traditional software scheduling methods. Experimental results demonstrate consistent workload distribution, fair machine utilization, and up to 1060x speedup over single-threaded software scheduling policy implementations. This makes the SOS accelerator a strong candidate for deployment in high-performance computing system, deeplearning pipelines, and other performance-critical applications.