Scaling Up Throughput-oriented LLM Inference Applications on Heterogeneous Opportunistic GPU Clusters with Pervasive Context Management

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address long queueing delays, high hardware costs, and low cluster utilization caused by static resource allocation in LLM inference, this paper proposes a dynamic scheduling framework tailored for throughput-oriented workloads on heterogeneous, ephemeral GPU clusters. The core contribution is a general-purpose context management mechanism that enables seamless migration and reuse of LLM inference states across unstable resources. It achieves this through context partitioning, caching, cross-node state recovery, and consistency maintenance. This mechanism supports fine-grained, elastic resource scaling without pre-emptive allocation. Experimental evaluation on ephemeral GPU clusters demonstrates that our approach reduces task execution time by 98.1%, while significantly improving resource utilization and inference throughput.

Technology Category

Application Category

📝 Abstract
The widespread growth in LLM developments increasingly demands more computational power from clusters than what they can supply. Traditional LLM applications inherently require huge static resource allocations, which force users to either wait in a long job queue and accept progress delay, or buy expensive hardware to fulfill their needs and exacerbate the demand-supply problem. However, not all LLM applications are latency-sensitive and can instead be executed in a throughput-oriented way. This throughput orientation allows a dynamic allocation that opportunistically pools available resources over time, avoiding both the long queue and expensive GPU purchases. Effectively utilizing opportunistic resources brings numerous challenges nevertheless. Our solution, pervasive context management, exploits the common computational context in LLM applications and provides mechanisms and policies that allow seamless context reuse on opportunistic resources. Our evaluation shows an LLM application with pervasive context management on opportunistic resources reduces its execution time by 98.1%.
Problem

Research questions and friction points this paper is trying to address.

Optimizing throughput-oriented LLM inference on heterogeneous GPU clusters
Reducing execution delays and expensive hardware requirements
Enabling dynamic resource allocation through context reuse
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic resource allocation on opportunistic GPU clusters
Pervasive context management for computational reuse
Throughput-oriented execution reducing latency constraints
🔎 Similar Papers
No similar papers found.