Managing Multi Instance GPUs for High Throughput and Energy Savings

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

To address low throughput and poor energy efficiency of Multi-Instance GPU (MIG) under scientific computing and ML (including LLM) workloads, this paper proposes a dynamic resource scheduling framework. Our method introduces— for the first time—joint dynamic memory estimation and adaptive MIG partition fusion/splitting, integrated with workload-aware scheduling and process lifecycle management (supporting checkpoint-based recovery and proactive restart optimization). Evaluated on NVIDIA A100 GPUs, the framework achieves up to 6.20× higher throughput and 5.93× better energy efficiency for scientific computing; 1.59× throughput and 1.12× energy efficiency gains for general ML workloads; and 1.43× throughput and 1.11× energy efficiency improvements for LLM inference. The core contribution lies in enabling fine-grained, robust, and energy-aware coordination of MIG resources, significantly enhancing chip-level concurrency utilization while maintaining hardware isolation guarantees.

Technology Category

Application Category

📝 Abstract

Modern GPUs such as the Ampere series (A30, A100) as well as the Hopper series (H100, H200) offer performance as well as security isolation features. They also support a good amount of concurrency, but taking advantage of it can be quite challenging due to the complex constraints on partitioning the chip. In this work, we develop partitioning and scheduling schemes for a variety of workloads, ranging from scientific to modern ML workloads, including LLMs. We develop several schemes involving dynamic memory estimation, partition fusion and partition fission. We also support process restart to recover from out-of-memory errors for workloads and early restart as an optimization. This approach yields up to 6.20x throughput and 5.93x energy improvements for general workloads; and we see 1.59x and 1.12x improvement to throughput and energy, respectively, for ML workloads on an A100 GPU. We leverage this technique on LLM workloads and show good improvements, including up to 1.43x throughput improvement and 1.11x energy savings.

Problem

Research questions and friction points this paper is trying to address.

Optimizing GPU partitioning for diverse workloads

Enhancing throughput and energy efficiency

Managing concurrency and memory constraints effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic memory estimation for GPU partitioning

Partition fusion and fission techniques

Process restart optimization for memory errors

🔎 Similar Papers

No similar papers found.