Threshold-Based Exclusive Batching for LLM Inference

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the degraded inference efficiency on bandwidth-constrained GPUs caused by interference between prefill and decode phases in hybrid batching, particularly under dynamic workloads. The study establishes, for the first time, a closed-form condition to determine the performance crossover point between exclusive and hybrid batching. Building on this, it introduces a phase-switching threshold and a memory-safe batch size derived from memory bandwidth, model scale, and request composition, enabling the design of EB+, an online scheduler that operates without manual intervention. Experiments demonstrate that EB+ improves throughput by up to 41.9% on bandwidth-limited GPUs and consistently achieves or closely approaches optimal throughput under non-stationary traffic, outperforming conventional hybrid batching by as much as 36.4%.

📝 Abstract

Mixed batching (MB)--interleaving prefill and decode in a single batch--has become the standard scheduling strategy for large language model (LLM) inference due to its efficiency in maximizing compute and memory utilization. However, through controlled experiments, we find that prefill-decode interference inflates MB's per-step marginal cost above that of pure decode. On the high-bandwidth H200 (4.8 TB/s), this occurs only when decode tokens exceed 80% of the batch; however, on the bandwidth-constrained RTX PRO 6000 (1.792 TB/s), this threshold plummets to just 20%. Consequently, the optimal choice between MB and exclusive batching (EB) fundamentally depends on GPU memory bandwidth, model size, and workload composition. We derive a closed-form condition for this EB-MB performance crossover, along with asymptotically optimal phase-switching thresholds and memory-safe batch sizing for EB. Optimized EB achieves up to 41.9% higher throughput on bandwidth-constrained GPUs, while MB retains its advantage on high-bandwidth hardware with larger models. Our hybrid scheduler EB+ applies this condition online to dynamically switch between EB and MB without manual intervention. Under non-stationary traffic with distribution or concurrency shifts, EB+ attains the highest or near-highest throughput in every setting, outperforming MB by up to 36.4%.

Problem

Research questions and friction points this paper is trying to address.

LLM inference

mixed batching

exclusive batching

memory bandwidth

scheduling strategy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Exclusive Batching

Mixed Batching

LLM Inference Scheduling