Predictable LLM Serving on GPU Clusters

📅 2025-08-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high tail-latency variability and frequent SLO violations in LLM inference on shared A100 clusters—caused by PCIe interference from noisy neighbors—this paper proposes a host-level, architecture-agnostic lightweight controller. The method integrates dynamic MIG reconfiguration, PCIe topology-aware scheduling, MPS resource quotas, and cgroup-based I/O isolation, augmented by a feedback control loop with built-in cooling to ensure end-to-end SLO compliance without modifying models or inference frameworks. Evaluated on single-node and 16-GPU cluster deployments, the controller reduces SLO violation rates by 32%, improves p99 latency by 15%, and incurs ≤5% throughput overhead. In vLLM + OLMo 7B workloads, it achieves 10–15% improvement in p99 time-to-first-token (TTFT). The approach delivers robust, low-overhead SLO guarantees for multi-tenant GPU inference under PCIe contention.

Technology Category

Application Category

📝 Abstract
Latency-sensitive inference on shared A100 clusters often suffers noisy-neighbor interference on the PCIe fabric, inflating tail latency and SLO violations. We present a fabric-agnostic, VM-deployable host-level controller that combines dynamic Multi-Instance GPU (MIG) reconfiguration, PCIe-aware placement, and lightweight guardrails (MPS quotas, cgroup I/O). It samples per-tenant tails and system signals, uses topology hints to avoid PCIe hot spots, and gates actions with dwell/cool-down to avoid thrash. On a single host and a 2-node (16-GPU) cluster, SLO miss-rate is reduced by (approx)32% ((approx)1.5) and p99 latency improves (approx)15% with (leq)5% throughput cost versus static MIG and naive placement; ablations show MIG and placement contribute comparably. We also evaluate LLM serving with vLLM on OLMo 2 7B Instruct: TTFT p99 improves (approx)10--15% at (leq)5% cost without changing the controller.
Problem

Research questions and friction points this paper is trying to address.

Reducing noisy-neighbor interference on PCIe fabric
Improving tail latency and SLO violations in LLM serving
Optimizing GPU resource allocation through dynamic reconfiguration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic MIG reconfiguration for GPU clusters
PCIe-aware placement to avoid interference
Lightweight guardrails with quotas and cgroups
🔎 Similar Papers
No similar papers found.
E
Erfan Darzi
Harvard University, MIT
S
Shreeanant Bharadwaj
Northeastern University
Sree Bhargavi Balija
Sree Bhargavi Balija
UC San Diego
Conformal PredictionsInterpretabilityFederated learning