AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Large language models (LLMs) suffer from quadratic computational complexity—O(L²)—in self-attention during prefill-only inference tasks (e.g., classification, QA, recommendation, text embedding), creating a critical performance bottleneck. Method: We first observe significant graph-structural similarity across semantically distinct sentences in multi-layer, multi-head attention mechanisms; leveraging this insight, we propose a memory-database–based cross-request attention graph caching and reuse framework. Our approach integrates efficient similarity retrieval, lightweight cache management, and dynamic update strategies, supporting CPU/GPU co-deployment. Contribution/Results: Experiments show negligible accuracy degradation (<0.3%), with end-to-end speedups of 1.2× on CPU (2.0× in attention computation) and 1.6× on GPU (3.0× in attention computation), substantially improving throughput for prefill-dominated LLM inference.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are widely used in generative applications such as chatting, code generation, and reasoning. However, many realworld workloads such as classification, question answering, recommendation, and text embedding rely solely on the prefill stage of inference, where the model encodes input sequences without performing autoregressive decoding. In these prefill only scenarios, the self-attention computation becomes the primary performance bottleneck due to its quadratic complexity with respect to sequence length. In this paper, we observe that semantically different sentences often produce similar attention maps across layers and heads. Building on this insight, we propose AttnCache, a framework that accelerates the prefill stage of LLM inference by retrieving and reusing similar attention maps. Based on an attention map memorization database, AttnCache employs efficient caching and similarity search techniques to identify and reuse pre-cached attention maps during inference, thereby reducing the computational overhead of self-attention. Experimental results show that AttnCache achieves an average of 1.2x end-to-end and 2x attention speedup on CPU, and 1.6x end-to-end and 3x attention speedup on GPU, with negligible accuracy degradation.

Problem

Research questions and friction points this paper is trying to address.

Accelerating self-attention computation in LLM prefill stage

Reducing quadratic complexity bottleneck in non-autoregressive inference

Reusing cached attention maps to minimize computational overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reuses similar attention maps via caching

Accelerates prefill stage with similarity search

Reduces self-attention computational overhead significantly

🔎 Similar Papers

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval