🤖 AI Summary
In multi-user LLM/MLLM inference, coarse-grained prefix and semantic caching impairs response diversity. To address this, we propose a semantic-aware multi-user KV cache sharing mechanism. Our method introduces: (1) a fine-grained KV reuse paradigm grounded in semantic alignment and response-preserving differential editing; and (2) dynamic KV block alignment coupled with multi-user collaborative scheduling. Evaluated on real-world dialogue datasets, our approach achieves over 60% improvement in KV cache hit rate, with no statistically significant degradation in BLEU or ROUGE-L scores. Moreover, GPU resource consumption is substantially reduced. To the best of our knowledge, this is the first work to enable high-accuracy, high-efficiency, cross-user, semantic-level KV caching while rigorously preserving generative diversity.
📝 Abstract
This paper presents KVShare, a multi-user Key-Value (KV) Cache sharing technology based on semantic similarity, designed to enhance the inference efficiency of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Addressing the limitations of existing prefix caching (strict text prefix matching) and semantic caching (loss of response diversity), KVShare achieves fine-grained KV cache reuse through semantic alignment algorithms and differential editing operations. Experiments on real-world user conversation datasets demonstrate that KVShare improves KV cache hit rates by over 60%, while maintaining output quality comparable to full computation (no significant degradation in BLEU and Rouge-L metrics). This approach effectively reduces GPU resource consumption and is applicable to scenarios with repetitive queries, such as healthcare and education.