MiniPIC: Flexible Position-Independent Caching in <100LOC

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing prefix caching mechanisms struggle to efficiently reuse KV caches for structured inputs unless request prefixes are identical, while current position-agnostic caching approaches either require extensive server-side modifications or incur substantial host-to-device transfer overhead. This work proposes a lightweight, position-agnostic caching method that leverages position-encoding-free KV caches combined with three user-controllable primitives—block-aligned padding, span separator (SSep), and prompt dependency (PDep)—to unify support for Block-Attention, EPIC, and Prompt Cache within vLLM through fewer than one hundred lines of core modifications, while natively enabling CPU offloading. Experiments on 2WikiMultihopQA demonstrate a 49% improvement in prefill throughput, two orders of magnitude reduction in first-token latency for cached spans, linear scalability for non-cached spans, and a worst-case overhead of only 5.7%.

📝 Abstract

Retrieval-augmented and agentic workloads repeatedly prefill recurring predictable structured inputs (which we call "spans") such as documents and code files. Yet, prefix caching in engines such as vLLM cannot reuse their KV entries unless they share identical prefixes with another request, while Position-Independent Caching (PIC) implementations within production-grade inference servers typically either require substantial server code changes or keep KV state outside the server, incurring host-to-device transfer overhead. We present Minimalistic PIC (MiniPIC): a minimal, flexible and fast vLLM design built from two ingredients: positional-encoding-free KV cache and user-controlled cache-reuse primitives. MiniPIC stores unrotated K vectors in the KV cache, applies RoPE to K tiles inside attention using per-request logical positions, and exposes three user-facing and token-level primitives: block-aligned padding, span separator (SSep), and prompt depend (PDep), that modify hashing behavior and effective block-level causal attention structure. With fewer than 100 lines of core-engine changes plus a custom attention backend, these primitives are sufficient to realize multiple PIC methods, including Block-Attention, EPIC, and Prompt Cache, within the same running vLLM instance, while natively integrating with KV cache CPU offload implementations. On 2WikiMultihopQA, MiniPIC with interleaved scheduling improves prefill throughput by 49% over baseline vLLM, reduces cached-span time-to-first-token by up to two orders of magnitude, preserves the linear prefill scaling of uncached spans, and incurs only 5.7% worst-case overhead.

Problem

Research questions and friction points this paper is trying to address.

Position-Independent Caching

KV cache reuse

prefix caching

retrieval-augmented generation

inference optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Position-Independent Caching

KV Cache

RoPE