🤖 AI Summary
This work addresses the memory and computational bottlenecks in long-context large language model (LLM) inference on edge devices, which arise from the linear growth of the key-value (KV) cache with context length. To overcome these limitations, the authors propose HillInfer, a novel framework that introduces SmartSSD to establish a hierarchical KV cache architecture between CPU and storage. HillInfer incorporates an in-storage importance-aware mechanism for selectively evicting less critical KV entries, alongside an adaptive prefetching pipeline and a coordinated scheduling strategy across GPU, CPU, and SmartSSD to effectively overlap computation and data movement. Experimental results on a commercial PC platform demonstrate that HillInfer achieves up to 8.56× end-to-end inference speedup while preserving model accuracy.
📝 Abstract
Deploying Large Language Models (LLMs) on edge devices such as PCs enables low-latency inference with strong privacy guarantees, but long-context inference is fundamentally constrained by limited memory and compute resources. Beyond model parameters, the KV cache becomes the dominant bottleneck due to its linear growth with context length. Although prior work exploits contextual sparsity to evict unimportant KV data, these approaches are largely designed for memory-rich platforms and incur prohibitive data transfer overhead when applied to resource-constrained edge devices with external storage. In this paper, we propose HillInfer, an importance-aware long-context LLM inference framework on the edge that leverages SmartSSD-assisted hierarchical KV cache management. HillInfer jointly manages KV cache pools across the CPU and SmartSSD, and performs in-storage importance evaluation to reduce unnecessary data movement. Furthermore, we design an adaptive, prefetch-based pipeline that overlaps computation and KV data transfer across GPU, CPU, and SmartSSD, minimizing end-to-end inference latency without sacrificing accuracy. We implement HillInfer on a PC with a commodity GPU, and experiments across multiple models and benchmarks demonstrate up to 8.56 $\times$ speedup over baselines while preserving model accuracy.