Optimizing LLM Inference Throughput via Memory-aware and SLA-constrained Dynamic Batching

📅 2025-03-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the challenge of balancing throughput and latency in large language model (LLM) inference under GPU memory constraints, this paper proposes a memory-aware and SLA-driven dynamic batching method. The approach continuously monitors GPU memory utilization and per-request latency feedback, enabling runtime adaptation of batch sizes via fine-grained resource modeling and elastic scheduling—thereby strictly satisfying service-level agreement (SLA) constraints. Compared to static batching, our method achieves 8–28% higher inference throughput while maintaining low latency, increases service capacity by 22%, and remains fully compatible with mainstream inference frameworks. Its core innovation lies in unifying memory-state awareness, latency-feedback control, and hard SLA constraint modeling within a single dynamic batching framework—yielding an efficient, robust, and production-deployable online optimization solution.

Technology Category

Application Category

📝 Abstract

The increasing adoption of large language models (LLMs) necessitates inference serving systems that can deliver both high throughput and low latency. Deploying LLMs with hundreds of billions of parameters on memory-constrained GPUs exposes significant limitations in static batching methods. Current inference serving systems often treat batch sizes as fixed hyper-parameters, hindering real-time adaptation to varying system conditions. In this paper, we propose a dynamic batching method that continuously monitors memory utilization and adheres to service-level agreements (SLAs) to enable real-time batch size configuration adjustment. The method comprises two core components: a memory-aware batch scheduler that dynamically allocates GPU resources and a latency feedback mechanism that optimizes decoding processes under SLA constraints. The numerical experiments demonstrate throughput gains of 8% to 28% and capacity improvements of 22% compared to traditional static batching methods, while maintaining full compatibility with existing inference infrastructure. These results highlight the effectiveness of dynamic batching in balancing computational efficiency and quality-of-service requirements for contemporary LLM deployment scenarios. The source code of this work is publicly available at https://github.com/KevinLee1110/dynamic-batching.

Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM inference throughput under memory constraints.

Dynamic batching for real-time batch size adjustment.

Balancing computational efficiency and SLA requirements.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic batching adjusts batch sizes in real-time.

Memory-aware scheduler optimizes GPU resource allocation.

Latency feedback ensures SLA compliance during decoding.

🔎 Similar Papers

No similar papers found.

Authors to Follow