🤖 AI Summary
This work addresses the challenges of low GPU utilization, request starvation, and memory overflow arising from concurrent adapter scheduling in distributed large language model (LLM) adapter serving. To tackle these issues, the paper proposes the first optimization framework that integrates digital twin technology with lightweight machine learning. The approach constructs a high-fidelity digital twin of the LLM adapter service, trains a low-overhead performance prediction model, and employs a greedy placement algorithm to achieve efficient deployment using the minimal number of GPUs required to meet target throughput or latency constraints. Experimental results demonstrate that the framework achieves GPU throughput prediction errors below 5%, accelerates simulation by 90× compared to full-scale benchmarking, and substantially reduces the number of GPUs needed while improving overall resource efficiency.
📝 Abstract
Large Language Model (LLM) adapters enable low-cost model specialization, but introduce complex caching and scheduling challenges in distributed serving systems where hundreds of adapters must be hosted concurrently. While prior work has largely focused on latency minimization, resource efficiency through throughput maximization remains underexplored. This paper presents a data-driven pipeline that, for a given workload, computes an adapter placement that serves the workload with the minimum number of GPUs while avoiding request starvation and GPU memory errors. To that end, the approach identifies the maximum feasible throughput attainable on each GPU by leveraging accurate performance predictions learned from real serving behavior. The proposed pipeline integrates three components: (i) a Digital Twin (DT) tailored to LLM-adapter serving, (ii) a distilled machine learning (ML) model trained on DT-generated data, and (iii) a greedy placement algorithm that exploits ML-based performance estimates to maximize GPU efficiency. The DT emulates real system dynamics with high fidelity, achieving below 5% throughput estimation error while executing up to 90 times faster than full LLM benchmarking across both predictable and unpredictable workloads. The learned ML models further accelerate performance estimation with marginal accuracy degradation, enabling scalable optimization. Experimental results demonstrate that the pipeline substantially improves GPU efficiency by reducing the number of GPUs required to sustain target workloads. Beyond GPU efficiency, the pipeline can be adapted to alternative objectives, such as latency minimization, highlighting its versatility for future large-scale LLM serving infrastructures.