KVShare: Semantic-Aware Key-Value Cache Sharing for Efficient Large Language Model Inference

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

In multi-user LLM/MLLM inference, coarse-grained prefix and semantic caching impairs response diversity. To address this, we propose a semantic-aware multi-user KV cache sharing mechanism. Our method introduces: (1) a fine-grained KV reuse paradigm grounded in semantic alignment and response-preserving differential editing; and (2) dynamic KV block alignment coupled with multi-user collaborative scheduling. Evaluated on real-world dialogue datasets, our approach achieves over 60% improvement in KV cache hit rate, with no statistically significant degradation in BLEU or ROUGE-L scores. Moreover, GPU resource consumption is substantially reduced. To the best of our knowledge, this is the first work to enable high-accuracy, high-efficiency, cross-user, semantic-level KV caching while rigorously preserving generative diversity.

Technology Category

Application Category

📝 Abstract

This paper presents KVShare, a multi-user Key-Value (KV) Cache sharing technology based on semantic similarity, designed to enhance the inference efficiency of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Addressing the limitations of existing prefix caching (strict text prefix matching) and semantic caching (loss of response diversity), KVShare achieves fine-grained KV cache reuse through semantic alignment algorithms and differential editing operations. Experiments on real-world user conversation datasets demonstrate that KVShare improves KV cache hit rates by over 60%, while maintaining output quality comparable to full computation (no significant degradation in BLEU and Rouge-L metrics). This approach effectively reduces GPU resource consumption and is applicable to scenarios with repetitive queries, such as healthcare and education.

Problem

Research questions and friction points this paper is trying to address.

Enhance LLM inference via semantic-aware KV cache sharing

Overcome limitations of prefix and semantic caching methods

Reduce GPU usage while maintaining output quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-aware KV cache sharing technology

Fine-grained reuse via semantic alignment

Boosts cache hits by 60%+

🔎 Similar Papers

No similar papers found.