Scaling Up Throughput-oriented LLM Inference Applications on Heterogeneous Opportunistic GPU Clusters with Pervasive Context Management

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address long queueing delays, high hardware costs, and low cluster utilization caused by static resource allocation in LLM inference, this paper proposes a dynamic scheduling framework tailored for throughput-oriented workloads on heterogeneous, ephemeral GPU clusters. The core contribution is a general-purpose context management mechanism that enables seamless migration and reuse of LLM inference states across unstable resources. It achieves this through context partitioning, caching, cross-node state recovery, and consistency maintenance. This mechanism supports fine-grained, elastic resource scaling without pre-emptive allocation. Experimental evaluation on ephemeral GPU clusters demonstrates that our approach reduces task execution time by 98.1%, while significantly improving resource utilization and inference throughput.

Technology Category

Application Category

📝 Abstract

The widespread growth in LLM developments increasingly demands more computational power from clusters than what they can supply. Traditional LLM applications inherently require huge static resource allocations, which force users to either wait in a long job queue and accept progress delay, or buy expensive hardware to fulfill their needs and exacerbate the demand-supply problem. However, not all LLM applications are latency-sensitive and can instead be executed in a throughput-oriented way. This throughput orientation allows a dynamic allocation that opportunistically pools available resources over time, avoiding both the long queue and expensive GPU purchases. Effectively utilizing opportunistic resources brings numerous challenges nevertheless. Our solution, pervasive context management, exploits the common computational context in LLM applications and provides mechanisms and policies that allow seamless context reuse on opportunistic resources. Our evaluation shows an LLM application with pervasive context management on opportunistic resources reduces its execution time by 98.1%.

Problem

Research questions and friction points this paper is trying to address.

Optimizing throughput-oriented LLM inference on heterogeneous GPU clusters

Reducing execution delays and expensive hardware requirements

Enabling dynamic resource allocation through context reuse

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic resource allocation on opportunistic GPU clusters

Pervasive context management for computational reuse

Throughput-oriented execution reducing latency constraints

🔎 Similar Papers

No similar papers found.

Authors to Follow