Breaking the Boundaries of Long-Context LLM Inference: Adaptive KV Management on a Single Commodity GPU

📅 2025-06-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address memory constraints on KV caches and high cross-tier offloading overhead—particularly due to limited disk bandwidth and expensive importance estimation—in long-context LLM inference on single consumer-grade GPUs, this paper proposes AdaptKV, an adaptive GPU-CPU-disk three-tier KV cache management system. Its core innovations include: (1) a variable-length KV chunking strategy based on attention weight distribution; (2) a lightweight KV summarization mechanism; and (3) importance-aware dynamic compression coupled with pipelined offloading. These techniques jointly minimize I/O and computational redundancy while preserving output quality. Experiments demonstrate that AdaptKV reduces average inference latency by 3.46× over state-of-the-art methods, achieving up to 5.47× speedup in large-batch scenarios. Notably, it enables efficient and privacy-preserving inference on context lengths exceeding 100K tokens using only a single consumer-grade GPU.

Technology Category

Application Category

📝 Abstract
Advanced Large Language Models (LLMs) have achieved impressive performance across a wide range of complex and long-context natural language tasks. However, performing long-context LLM inference locally on a commodity GPU (a PC) with privacy concerns remains challenging due to the increasing memory demands of the key-value (KV) cache. Existing systems typically identify important tokens and selectively offload their KV data to GPU and CPU memory. The KV data needs to be offloaded to disk due to the limited memory on a commodity GPU, but the process is bottlenecked by token importance evaluation overhead and the disk's low bandwidth. In this paper, we present LeoAM, the first efficient importance-aware long-context LLM inference system for a single commodity GPU with adaptive hierarchical GPU-CPU-Disk KV management. Our system employs an adaptive KV management strategy that partitions KV data into variable-sized chunks based on the skewed distribution of attention weights across different layers to reduce computational and additional transmission overheads. Moreover, we propose a lightweight KV abstract method, which minimizes transmission latency by storing and extracting the KV abstract of each chunk on disk instead of the full KV data. LeoAM also leverages the dynamic compression and pipeline techniques to further accelerate inference. Experimental results demonstrate that LongInfer achieves an average inference latency speedup of 3.46x, while maintaining comparable LLM response quality. In scenarios with larger batch sizes, it achieves up to a 5.47x speedup.
Problem

Research questions and friction points this paper is trying to address.

Efficient long-context LLM inference on single commodity GPU
Reducing KV cache memory demands for local privacy-aware processing
Overcoming disk bandwidth and token evaluation bottlenecks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive hierarchical GPU-CPU-Disk KV management
Lightweight KV abstract method for disk storage
Dynamic compression and pipeline techniques acceleration
🔎 Similar Papers
No similar papers found.
H
He Sun
Department of Computer Science and Technology & Suzhou Institute for Advanced Research, University of Science and Technology of China, Hefei, China
L
Li Li
IOTSC, University of Macau, Macau, China
Mingjun Xiao
Mingjun Xiao
University of Science and Technology of China
Mobile ComputingCrowdsensingMobile Social NetworkVechular Network
C
Chengzhong Xu
IOTSC, University of Macau, Macau, China