🤖 AI Summary
To address long queueing delays, high hardware costs, and low cluster utilization caused by static resource allocation in LLM inference, this paper proposes a dynamic scheduling framework tailored for throughput-oriented workloads on heterogeneous, ephemeral GPU clusters. The core contribution is a general-purpose context management mechanism that enables seamless migration and reuse of LLM inference states across unstable resources. It achieves this through context partitioning, caching, cross-node state recovery, and consistency maintenance. This mechanism supports fine-grained, elastic resource scaling without pre-emptive allocation. Experimental evaluation on ephemeral GPU clusters demonstrates that our approach reduces task execution time by 98.1%, while significantly improving resource utilization and inference throughput.
📝 Abstract
The widespread growth in LLM developments increasingly demands more computational power from clusters than what they can supply. Traditional LLM applications inherently require huge static resource allocations, which force users to either wait in a long job queue and accept progress delay, or buy expensive hardware to fulfill their needs and exacerbate the demand-supply problem. However, not all LLM applications are latency-sensitive and can instead be executed in a throughput-oriented way. This throughput orientation allows a dynamic allocation that opportunistically pools available resources over time, avoiding both the long queue and expensive GPU purchases. Effectively utilizing opportunistic resources brings numerous challenges nevertheless. Our solution, pervasive context management, exploits the common computational context in LLM applications and provides mechanisms and policies that allow seamless context reuse on opportunistic resources. Our evaluation shows an LLM application with pervasive context management on opportunistic resources reduces its execution time by 98.1%.