Minos: Systematically Classifying Performance and Power Characteristics of GPU Workloads on HPC Clusters

📅 2026-04-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the growing diversity of GPU workloads in high-performance computing (HPC) clusters, where traditional per-application profiling incurs high overhead and poor scalability, hindering joint optimization of performance and power. To overcome these limitations, the authors propose Minos—the first unified GPU workload classification framework based on low-overhead feature extraction. By integrating lightweight runtime profiling, feature engineering, and clustering analysis, Minos groups behaviorally similar workloads into a limited set of categories, drastically reducing profiling costs for new applications. Experimental evaluation across 18 representative graph analytics, HPC, and machine learning workloads demonstrates that Minos achieves average prediction errors of only 4% for power and 3% for performance—outperforming the state-of-the-art by 10%. Furthermore, it reduces profiling time for frequency-limiting behavior of unseen applications by 89%, substantially improving cross-domain generalization.
📝 Abstract
As large-scale HPC compute clusters increasingly adopt accelerators such as GPUs to meet the voracious demands of modern workloads, these clusters are increasingly becoming power constrained. Unfortunately, modern applications can often temporarily exceed the power ratings of the accelerators ("power spikes"). Thus, current and future HPC systems must optimize for both power and performance together. However, this is made difficult by increasingly diverse applications, which often require bespoke optimizations to run efficiently on each cluster. Traditionally researchers overcome this problem by profiling applications on specific clusters and optimizing, but the scale, algorithmic diversity, and lack of effective tools make this challenging. To overcome these inefficiencies, we propose Minos, a systematic classification mechanism that identifies similar application characteristics via low-cost profiling for power and performance. This allows us to group similarly behaving workloads into a finite number of distinct classes and reduce the overhead of extensively profiling new workloads. For example, when predicting frequency capping behavior for a previously unseen application, Minos reduces profiling time by 89%. Moreover, across 18 popular graph analytics, HPC, HPC+ML, and ML workloads, Minos achieves a mean error of 4% for power predictions and 3% for performance predictions, significantly improving predictions over state-of-the-art approaches by 10%.
Problem

Research questions and friction points this paper is trying to address.

GPU workloads
power constraints
performance optimization
workload classification
HPC clusters
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPU workload classification
power-performance modeling
low-cost profiling
frequency capping prediction
HPC cluster optimization
🔎 Similar Papers
No similar papers found.
R
Rutwik Jain
University of Wisconsin-Madison, USA
Yiwei Jiang
Yiwei Jiang
Worcester Polytechnic Institute
Medical RoboticsComputer Assisted SurgeryComputer VisionMachine Learning
M
Matthew D. Sinclair
University of Wisconsin-Madison, USA
S
Shivaraman Venkataraman
University of Wisconsin-Madison, USA