Power Aware Dynamic Reallocation For Inference

📅 2026-01-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the critical bottleneck imposed by power constraints on performance and cost efficiency in large language model inference. To this end, we propose RAPID, a novel framework that, for the first time, jointly optimizes static and dynamic GPU role assignment with power budget allocation. RAPID employs a power-aware decoupled inference architecture, a dynamic resource scheduling algorithm, and adaptive power budget management to sustain high effective throughput under strict power limits. Experimental results demonstrate that, compared to static allocation strategies, RAPID achieves up to a 2× improvement in SLO compliance under peak load, significantly enhancing both service-level objective attainment and application consistency—without introducing additional system complexity or cost.

Technology Category

Application Category

📝 Abstract
Disaggregation has emerged as a powerful strategy for optimizing large language model (LLM) inference by separating compute-intensive prefill and memory-bound decode phases across specialized GPUs. This separation improves utilization and throughput under fixed hardware capacity. However, as model and cluster scales grow, power, rather than compute, has become the dominant limiter of overall performance and cost efficiency. In this paper, we propose RAPID, a power-aware disaggregated inference framework that jointly manages GPU roles and power budgets to sustain goodput within strict power caps. RAPID utilizes static and dynamic power reallocation in addition to GPU reallocation to improve performance under fixed power bounds. RAPID improves overall performance and application consistency beyond what is achievable in current disaggregation solutions, resulting in up to a 2x improvement in SLO attainment at peak load when compared to a static assignment without an increase in complexity or cost.
Problem

Research questions and friction points this paper is trying to address.

power-aware
disaggregated inference
LLM inference
power budget
goodput
Innovation

Methods, ideas, or system contributions that make the work stand out.

power-aware
disaggregated inference
dynamic reallocation
LLM inference
GPU power management
🔎 Similar Papers
No similar papers found.
Yiwei Jiang
Yiwei Jiang
Worcester Polytechnic Institute
Medical RoboticsComputer Assisted SurgeryComputer VisionMachine Learning
S
Sangeeta Chowdhary
AMD Research and Advanced Development
N
Nathaniel Morris
AMD Research and Advanced Development
R
Rutwik Jain
AMD Research and Advanced Development; Department of Computer Sciences, University of Wisconsin-Madison, Madison, Wisconsin, USA
S
Srilatha Manne
AMD Research and Advanced Development
Sam Bayliss
Sam Bayliss
University of Glasgow