Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving

📅 2026-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of low GPU utilization, request starvation, and memory overflow arising from concurrent adapter scheduling in distributed large language model (LLM) adapter serving. To tackle these issues, the paper proposes the first optimization framework that integrates digital twin technology with lightweight machine learning. The approach constructs a high-fidelity digital twin of the LLM adapter service, trains a low-overhead performance prediction model, and employs a greedy placement algorithm to achieve efficient deployment using the minimal number of GPUs required to meet target throughput or latency constraints. Experimental results demonstrate that the framework achieves GPU throughput prediction errors below 5%, accelerates simulation by 90× compared to full-scale benchmarking, and substantially reduces the number of GPUs needed while improving overall resource efficiency.

Technology Category

Application Category

📝 Abstract
Large Language Model (LLM) adapters enable low-cost model specialization, but introduce complex caching and scheduling challenges in distributed serving systems where hundreds of adapters must be hosted concurrently. While prior work has largely focused on latency minimization, resource efficiency through throughput maximization remains underexplored. This paper presents a data-driven pipeline that, for a given workload, computes an adapter placement that serves the workload with the minimum number of GPUs while avoiding request starvation and GPU memory errors. To that end, the approach identifies the maximum feasible throughput attainable on each GPU by leveraging accurate performance predictions learned from real serving behavior. The proposed pipeline integrates three components: (i) a Digital Twin (DT) tailored to LLM-adapter serving, (ii) a distilled machine learning (ML) model trained on DT-generated data, and (iii) a greedy placement algorithm that exploits ML-based performance estimates to maximize GPU efficiency. The DT emulates real system dynamics with high fidelity, achieving below 5% throughput estimation error while executing up to 90 times faster than full LLM benchmarking across both predictable and unpredictable workloads. The learned ML models further accelerate performance estimation with marginal accuracy degradation, enabling scalable optimization. Experimental results demonstrate that the pipeline substantially improves GPU efficiency by reducing the number of GPUs required to sustain target workloads. Beyond GPU efficiency, the pipeline can be adapted to alternative objectives, such as latency minimization, highlighting its versatility for future large-scale LLM serving infrastructures.
Problem

Research questions and friction points this paper is trying to address.

GPU efficiency
LLM adapter serving
distributed serving
resource optimization
throughput maximization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Digital Twin
LLM Adapter Serving
GPU Efficiency
Data-Driven Optimization
Throughput Maximization
🔎 Similar Papers
No similar papers found.
F
Ferran Agullo
Barcelona Supercomputing Center (BSC), Barcelona, Spain; Universitat Politècnica de Catalunya - BarcelonaTech (UPC), Barcelona, Spain
J
Joan Oliveras
Barcelona Supercomputing Center (BSC), Barcelona, Spain; Universitat Politècnica de Catalunya - BarcelonaTech (UPC), Barcelona, Spain
Chen Wang
Chen Wang
IBM Research
Container CloudCloud ManagementManagement of large-scale VoDNetwork Management and Control
Alberto Gutierrez-Torre
Alberto Gutierrez-Torre
Universitat Politècnica de Catalunya (UPC) and Barcelona Supercomputing Center (BSC)
Machine LearningFederated LearningData AnalyticsIoTData Streaming
O
Olivier Tardieu
IBM Research, New York, USA
Alaa Youssef
Alaa Youssef
Research Manager, IBM T.J. Watson Research Center
Cloud computingDistributed systems
Jordi Torres
Jordi Torres
UPC Barcelona Tech - Barcelona Supercomputing Center
Supercomputing for Artificial IntelligenceArtificial IntelligenceDeep LearningReinforcement LearningSupercomputing
J
Josep Ll. Berral
Barcelona Supercomputing Center (BSC), Barcelona, Spain; Universitat Politècnica de Catalunya - BarcelonaTech (UPC), Barcelona, Spain