Rethinking LoRA Memory Through the Lens of KV Cache Compression

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This study investigates the interplay between parameterized memory mechanisms, such as Low-Rank Adaptation (LoRA), and contextual memory stored in the key-value (KV) cache, particularly in document-level question answering where context is often compressed. The role of document-specific LoRA under varying degrees of KV cache retention remains unclear. To address this, we systematically evaluate the contribution of document LoRA by progressively ablating key-value states and training with question-answering supervision. Our findings reveal that document LoRA primarily functions as a parameterized memory during decoding, especially when contextual information is scarce. Notably, even with all document context removed from the KV cache, LoRA recovers 13–21 ROUGE-L points, and QA-based supervision substantially outperforms conventional next-token prediction training. This work is the first to elucidate the complementary relationship between LoRA and the KV cache and demonstrates that LoRA is best deployed during the decoding phase.

📝 Abstract

Parametric retrieval augmentation encodes document information into lightweight, document-specific modules such as LoRA adapters, reducing the need to include all evidence as input context. However, it remains unclear how this parameter-side memory interacts with context-side memory stored in the KV cache. We study this interaction in document-level question answering by progressively evicting document key-value states and measuring when a document LoRA contributes beyond the retained context. We find that document LoRA adds little when the KV cache is largely intact, but becomes increasingly useful under aggressive compression, recovering 13-21 ROUGE-L points when no document context remains. The gain is largest when the base model encodes the document, and the adapter is applied only during answer generation, suggesting that document LoRA is better understood as decoding-time parametric memory than as a document encoder. Finally, QA-style supervision produces substantially stronger adapters than raw-context next-token-prediction. These results position document LoRA as a complementary memory channel whose value emerges precisely when context-side evidence is scarce.

Problem

Research questions and friction points this paper is trying to address.

LoRA

KV cache

parametric memory

document-level question answering

memory interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA

KV cache compression

parametric memory