A Performance Analysis of Task Scheduling for UQ Workflows on HPC Systems

📅 2025-03-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the low scheduling efficiency of uncertainty quantification (UQ) workflows on HPC systems—characterized by unknown task counts, highly heterogeneous execution times, and poor compatibility with static batch schedulers (e.g., SLURM)—this paper proposes a lightweight, system-modification-free dynamic co-scheduling framework. The framework innovatively integrates a UQ modeling bridge (a language-agnostic interface) with HyperQueue to enable runtime adaptive load balancing without assuming prior knowledge of task patterns. Evaluated on GS2 gyrokinetic plasma simulations and Gaussian process surrogate model validation, the framework reduces scheduling overhead by three orders of magnitude and cuts CPU time for long-running simulations by up to 38%, significantly outperforming pure SLURM-based scheduling. It establishes a new, efficient, and portable scheduling paradigm for large-scale, heterogeneous UQ simulations.

Technology Category

Application Category

📝 Abstract

Uncertainty Quantification (UQ) workloads are becoming increasingly common in science and engineering. They involve the submission of thousands or even millions of similar tasks with potentially unpredictable runtimes, where the total number is usually not known a priori. A static one-size-fits-all batch script would likely lead to suboptimal scheduling, and native schedulers installed on High Performance Computing (HPC) systems such as SLURM often struggle to efficiently handle such workloads. In this paper, we introduce a new load balancing approach suitable for UQ workflows. To demonstrate its efficiency in a real-world setting, we focus on the GS2 gyrokinetic plasma turbulence simulator. Individual simulations can be computationally demanding, with runtimes varying significantly-from minutes to hours-depending on the high-dimensional input parameters. Our approach uses UQ and Modelling Bridge, which offers a language-agnostic interface to a simulation model, combined with HyperQueue which works alongside the native scheduler. In particular, deploying this framework on HPC systems does not require system-level changes. We benchmark our proposed framework against a standalone SLURM approach using GS2 and a Gaussian Process surrogate thereof. Our results demonstrate a reduction in scheduling overhead by up to three orders of magnitude and a maximum reduction of 38% in CPU time for long-running simulations compared to the naive SLURM approach, while making no assumptions about the job submission patterns inherent to UQ workflows.

Problem

Research questions and friction points this paper is trying to address.

Optimize task scheduling for UQ workflows on HPC systems

Reduce scheduling overhead and CPU time for unpredictable tasks

Enable efficient load balancing without modifying HPC infrastructure

Innovation

Methods, ideas, or system contributions that make the work stand out.

UQ and Modelling Bridge for language-agnostic interface

HyperQueue combined with native scheduler

No system-level changes required for HPC deployment

🔎 Similar Papers

No similar papers found.

Authors to Follow