Optimizing SLO-oriented LLM Serving with PD-Multiplexing

📅 2025-04-20

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

Addressing the dual challenges of stringent SLO guarantees and high throughput in LLM serving—specifically across the prefill and decode phases—this work identifies a fundamental trade-off: geo-distributed partitioning meets SLOs but forfeits KV cache reuse, while in-place sharing improves throughput yet suffers from low resource utilization and high scheduling overhead due to tight phase coupling. To resolve this, we propose Phase-Decoupled (PD), a multi-tenant multiplexing architecture that enables spatially adaptive, per-phase resource allocation and in-situ KV cache sharing on a single GPU. PD incorporates adaptive group scheduling, contention-free performance modeling, and SLO-aware request dispatching. Experiments demonstrate an average 5.1× throughput improvement over baselines, with peak gains up to 17.5×, while achieving 100% SLO compliance under complex, multi-turn workloads.

Technology Category

Application Category

📝 Abstract

Modern LLM services demand high throughput and stringent SLO guarantees across two distinct inference phases-prefill and decode-and complex multi-turn workflows. However, current systems face a fundamental tradeoff: out-of-place compute partition enables per-phase SLO attainment, while in-place memory sharing maximizes throughput via KV cache reuse. Moreover, existing in-place compute partition also encounters low utilization and high overhead due to phase-coupling design. We present Yoda, a new LLM serving framework that resolves this tension via PD multiplexing, enabling in-place and phase-decoupled compute partition. Yoda leverages low-level GPU partitioning techniques to multiplex prefill and decode phases spatially and adaptively on shared GPUs, while preserving in-place memory sharing. To fully leverage the multiplexing capability, Yoda introduces an adaptive gang scheduling mechanism, a contention-free modeling method, and a SLO-aware dispatching policy. Evaluation shows that Yoda achieves an average $5.1 imes$ throughput improvement (up to $17.5 imes$) over state-of-the-art baselines, while consistently meeting SLO targets under complex LLM workloads.

Problem

Research questions and friction points this paper is trying to address.

Resolving throughput-SLO tradeoff in LLM serving

Overcoming low utilization in phase-coupled compute partition

Enabling efficient KV cache reuse with multiplexing

Innovation

Methods, ideas, or system contributions that make the work stand out.

PD multiplexing enables in-place compute partition

Low-level GPU partitioning for spatial multiplexing

Adaptive gang scheduling and SLO-aware dispatching

🔎 Similar Papers

No similar papers found.