🤖 AI Summary
To address memory constraints on KV caches and high cross-tier offloading overhead—particularly due to limited disk bandwidth and expensive importance estimation—in long-context LLM inference on single consumer-grade GPUs, this paper proposes AdaptKV, an adaptive GPU-CPU-disk three-tier KV cache management system. Its core innovations include: (1) a variable-length KV chunking strategy based on attention weight distribution; (2) a lightweight KV summarization mechanism; and (3) importance-aware dynamic compression coupled with pipelined offloading. These techniques jointly minimize I/O and computational redundancy while preserving output quality. Experiments demonstrate that AdaptKV reduces average inference latency by 3.46× over state-of-the-art methods, achieving up to 5.47× speedup in large-batch scenarios. Notably, it enables efficient and privacy-preserving inference on context lengths exceeding 100K tokens using only a single consumer-grade GPU.
📝 Abstract
Advanced Large Language Models (LLMs) have achieved impressive performance across a wide range of complex and long-context natural language tasks. However, performing long-context LLM inference locally on a commodity GPU (a PC) with privacy concerns remains challenging due to the increasing memory demands of the key-value (KV) cache. Existing systems typically identify important tokens and selectively offload their KV data to GPU and CPU memory. The KV data needs to be offloaded to disk due to the limited memory on a commodity GPU, but the process is bottlenecked by token importance evaluation overhead and the disk's low bandwidth. In this paper, we present LeoAM, the first efficient importance-aware long-context LLM inference system for a single commodity GPU with adaptive hierarchical GPU-CPU-Disk KV management. Our system employs an adaptive KV management strategy that partitions KV data into variable-sized chunks based on the skewed distribution of attention weights across different layers to reduce computational and additional transmission overheads. Moreover, we propose a lightweight KV abstract method, which minimizes transmission latency by storing and extracting the KV abstract of each chunk on disk instead of the full KV data. LeoAM also leverages the dynamic compression and pipeline techniques to further accelerate inference. Experimental results demonstrate that LongInfer achieves an average inference latency speedup of 3.46x, while maintaining comparable LLM response quality. In scenarios with larger batch sizes, it achieves up to a 5.47x speedup.