CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation

📅 2025-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address input-length limitations, high inference latency, and inefficient KV cache reuse in long-context processing of large language models (LLMs), this paper proposes an offline, training-free cache optimization framework. Our method comprises: (1) dynamic cache relocation—compatible with RoPE and ALiBi positional encodings—to enable query-agnostic offline KV cache reuse; (2) layer-adaptive cache pruning, which selectively retains critical key-value pairs based on per-layer importance; and (3) adaptive position remapping to mitigate positional drift in long sequences. Evaluated on Natural Questions and TriviaQA, our approach significantly outperforms baselines, supports >4K-context inference on LLaMA-2, maintains full accuracy for Qwen2 in multi-document retrieval, and reduces inference latency substantially. To the best of our knowledge, this is the first work to achieve efficient, general-purpose, training-free adaptation of LLMs for long-context retrieval-augmented generation (RAG).

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) excel across a variety of language tasks yet are constrained by limited input lengths and high computational costs. Existing approaches extemdash such as relative positional encodings (e.g., RoPE, ALiBi) and sliding window mechanisms extemdash partially alleviate these issues but often require additional training or suffer from performance degradation with longer inputs. In this paper, we introduce extbf{ extit{CacheFocus}}, a method that enhances length normalization and reduces inference latency without any further training. Our approach leverages query-independent, offline caching to efficiently reuse a Context KV Cache Store. We address the amplification of abnormal token distributions problem by re-positioning cached keys and introducing Layer-Adaptive Cache Pruning to discard low-relevance caches during pre-filling. Additionally, our Adaptive Positional Allocation Strategy dynamically reassigns cache positions to maximize the use of the available positional encoding range. Experiments on the Natural Questions and TriviaQA datasets demonstrate that CacheFocus outperforms alternative methods even when inputs exceed the $4$K limit of the exttt{LLaMA-2} model, emphasizing its practical effectiveness for long-context LLMs. Moreover, even with large maximum input length of exttt{Qwen2}, the performance of CacheFocus shows that it maintains consistent performance even as the number of documents increases, effectively managing long-text generation without degradation.
Problem

Research questions and friction points this paper is trying to address.

Enhance length normalization
Reduce inference latency
Manage long-text generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic cache re-positioning enhances efficiency
Layer-adaptive cache pruning discards low-relevance caches
Adaptive positional allocation maximizes encoding range
🔎 Similar Papers
No similar papers found.
K
Kun-Hui Lee
Jeonbuk National University
E
Eunhwan Park
Jeonbuk National University
D
Donghoon Han
Seoul National University
Seung-Hoon Na
Seung-Hoon Na
UNIST
Natural Language ProcessingInformation retrieval