Using Span Queries to Optimize for Cache and Attention Locality

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

To address the challenge of adapting inference servers to diverse emerging workloads—such as retrieval-augmented generation (RAG), inference-time scaling, and agent-based reasoning—this paper introduces the *span query* framework. It unifies tasks including chat, retrieval augmentation, and deep reasoning into expression trees subject to commutativity constraints, and for the first time reveals the critical impact of input-order sensitivity on KV cache efficiency and attention locality optimization. Methodologically, we formalize span syntax and semantics, jointly optimizing KV cache hit rates and attention locality with minimal system modification—only 492 lines altered in vLLM. Experiments demonstrate a 10–20× reduction in time-to-first-token (TTFT) across two non-chat scenarios. Moreover, attention-optimized span queries enable a 2B-parameter model to surpass the accuracy of a standard 8B-parameter server, highlighting substantial gains in both latency and model efficiency.

Technology Category

Application Category

📝 Abstract

Clients are evolving beyond chat completion, and now include a variety of innovative inference-time scaling and deep reasoning techniques. At the same time, inference servers remain heavily optimized for chat completion. Prior work has shown that large improvements to KV cache hit rate are possible if inference servers evolve towards these non-chat use cases. However, they offer solutions that are also optimized for a single use case, RAG. In this paper, we introduce the span query to generalize the interface to the inference server. We demonstrate that chat, RAG, inference-time scaling, and agentic workloads can all be expressed as span queries. We show how the critical distinction that had been assumed by prior work lies in whether the order of the inputs matter -- do they commute? In chat, they do not. In RAG, they often do. This paper introduces span queries, which are expression trees of inference calls, linked together with commutativity constraints. We describe span query syntax and semantics. We show how they can be automatically optimized to improve KV cache locality. We show how a small change to vLLM (affecting only 492 lines) can enable high-performance execution of span queries. Using this stack, we demonstrate that span queries can achieve 10-20x reductions in TTFT for two distinct non-chat use cases. Finally, we show that span queries can also be optimized to improve attention locality, so as to avoid the so-called lost-in-the-middle problem. We demonstrate that an attention-optimized span query on a 2b parameter model vastly outperforms the accuracy of a stock inference server using an 8b model.

Problem

Research questions and friction points this paper is trying to address.

Optimizing inference servers for non-chat workloads like RAG and agentic tasks

Improving KV cache locality through generalized interface called span queries

Solving attention locality problems like lost-in-the-middle phenomenon

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing span queries as expression trees with commutativity constraints

Automatically optimizing span queries to improve KV cache locality

Modifying vLLM for high-performance execution of span queries

🔎 Similar Papers

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval