KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse

📅 2025-02-21

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address computational redundancy arising from repeated encoding of shared documents in LLM-based RAG and similar scenarios, this paper proposes an efficient KV cache reuse method: precomputing KV caches for individual documents and dynamically concatenating them during inference. Our key contributions are: (1) a dynamic positional embedding alignment mechanism to resolve position offset issues in cross-document KV cache concatenation; (2) trainable cross-document special tokens that explicitly model inter-document relationships; and (3) a hybrid data fine-tuning strategy balancing generalizability and task-specific adaptation. To our knowledge, this is the first approach enabling high-accuracy, low-overhead cross-query KV cache reuse. Evaluated on seven open-domain QA benchmarks, it achieves an average 4% accuracy gain and up to 90% reduction in first-token latency, significantly improving both inference efficiency and effectiveness.

Technology Category

Application Category

📝 Abstract

We describe KVLink, an approach for efficient key-value (KV) cache reuse in large language models (LLMs). In many LLM applications, different inputs can share overlapping context, such as the same retrieved document appearing in multiple queries. However, the LLMs still need to encode the entire context for each query, leading to redundant computation. In this paper, we propose a new strategy to eliminate such inefficiency, where the KV cache of each document is precomputed independently. During inference, the KV caches of retrieved documents are concatenated, allowing the model to reuse cached representations instead of recomputing them. To mitigate the performance degradation of LLMs when using KV caches computed independently for each document, KVLink introduces three key components: adjusting positional embeddings of the KV cache at inference to match the global position after concatenation, using trainable special tokens to restore self-attention across independently encoded documents, and applying mixed-data fine-tuning to enhance performance while preserving the model's original capabilities. Experiments across 7 datasets demonstrate that KVLink improves question answering accuracy by an average of 4% over state-of-the-art methods. Furthermore, by leveraging precomputed KV caches, our approach reduces time-to-first-token by up to 90% compared to standard LLM inference, making it a scalable and efficient solution for context reuse.

Problem

Research questions and friction points this paper is trying to address.

Efficient KV cache reuse in LLMs

Reduce redundant computation in LLMs

Improve LLM performance and speed

Innovation

Methods, ideas, or system contributions that make the work stand out.

Precomputes independent KV caches

Concatenates KV caches for reuse

Adjusts embeddings for global context

🔎 Similar Papers

No similar papers found.