Memory Offloading for Large Language Model Inference with Latency SLO Guarantees

📅 2025-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current host-memory offloading mechanisms in LLM inference lack Service-Level Objective (SLO) guarantees, leading to either SLO violations or suboptimal memory utilization. To address this, we propose the first dynamic memory offloading framework explicitly designed for latency-bound SLO constraints. Leveraging the computational time determinism across decoding layers, our approach introduces a tunable “offloading interval” mechanism. It employs a two-stage strategy: offline range-based generation followed by online, iteration-level adaptive tuning—enabling SLO-driven memory scheduling. Evaluated under strict zero-violation SLO requirements, our method improves inference throughput by 1.85× over state-of-the-art baselines while maximizing host memory utilization efficiency—an achievement previously unattained under SLO constraints.

Technology Category

Application Category

📝 Abstract
Offloading large language models (LLMs) state to host memory during inference promises to reduce operational costs by supporting larger models, longer inputs, and larger batch sizes. However, the design of existing memory offloading mechanisms does not take latency service-level objectives (SLOs) into consideration. As a result, they either lead to frequent SLO violations or underutilize host memory, thereby incurring economic loss and thus defeating the purpose of memory offloading. This paper presents Select-N, a latency-SLO-aware memory offloading system for LLM serving. A key challenge in designing Select-N is to reconcile the tension between meeting SLOs and maximizing host memory usage. Select-N overcomes it by exploiting a unique characteristic of modern LLMs: during serving, the computation time of each decoder layer is deterministic. Leveraging this, Select-N introduces offloading interval, an internal tunable knob that captures the tradeoff between SLOs and host memory usage, thereby reducing the aforementioned challenge to pick an optimal offloading interval. With that, Select-N proposes a two-stage approach to automatically pick the offloading interval. The first stage is offline that generates the range of optimal offloading interval, while the second stage adjusts offloading interval at the granularity of inference iteration based on runtime hardware status. Our evaluation shows that Select-N consistently meets SLOs and improves the serving throughput over existing mechanisms by 1.85X due to maximizing the use of host memory.
Problem

Research questions and friction points this paper is trying to address.

Memory offloading for LLM inference
Latency SLO guarantees in offloading
Optimizing host memory usage efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latency-SLO-aware memory offloading
Deterministic decoder layer computation
Two-stage offloading interval optimization
🔎 Similar Papers
No similar papers found.
C
Chenxiang Ma
Peking University
Zhisheng Ye
Zhisheng Ye
PhD @ School of Computer Science, Peking University
Distributed SystemsResource managementLarge Language Models
Hanyu Zhao
Hanyu Zhao
Alibaba Group
Distributed SystemsSystems for AI
Z
Zehua Yang
Peking University
T
Tianhao Fu
Peking University
J
Jiaxun Han
Peking University
J
Jie Zhang
Peking University
Y
Yingwei Luo
Peking University
Xiaolin Wang
Xiaolin Wang
Professor of Computer Science, Peking University
Computer ArchitectureOperating SystemMemory System
Z
Zhenlin Wang
Michigan Tech
Y
Yong Li
Alibaba Cloud Computing
D
Diyu Zhou
Peking University