Using Span Queries to Optimize for Cache and Attention Locality

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of adapting inference servers to diverse emerging workloads—such as retrieval-augmented generation (RAG), inference-time scaling, and agent-based reasoning—this paper introduces the *span query* framework. It unifies tasks including chat, retrieval augmentation, and deep reasoning into expression trees subject to commutativity constraints, and for the first time reveals the critical impact of input-order sensitivity on KV cache efficiency and attention locality optimization. Methodologically, we formalize span syntax and semantics, jointly optimizing KV cache hit rates and attention locality with minimal system modification—only 492 lines altered in vLLM. Experiments demonstrate a 10–20× reduction in time-to-first-token (TTFT) across two non-chat scenarios. Moreover, attention-optimized span queries enable a 2B-parameter model to surpass the accuracy of a standard 8B-parameter server, highlighting substantial gains in both latency and model efficiency.

Technology Category

Application Category

📝 Abstract
Clients are evolving beyond chat completion, and now include a variety of innovative inference-time scaling and deep reasoning techniques. At the same time, inference servers remain heavily optimized for chat completion. Prior work has shown that large improvements to KV cache hit rate are possible if inference servers evolve towards these non-chat use cases. However, they offer solutions that are also optimized for a single use case, RAG. In this paper, we introduce the span query to generalize the interface to the inference server. We demonstrate that chat, RAG, inference-time scaling, and agentic workloads can all be expressed as span queries. We show how the critical distinction that had been assumed by prior work lies in whether the order of the inputs matter -- do they commute? In chat, they do not. In RAG, they often do. This paper introduces span queries, which are expression trees of inference calls, linked together with commutativity constraints. We describe span query syntax and semantics. We show how they can be automatically optimized to improve KV cache locality. We show how a small change to vLLM (affecting only 492 lines) can enable high-performance execution of span queries. Using this stack, we demonstrate that span queries can achieve 10-20x reductions in TTFT for two distinct non-chat use cases. Finally, we show that span queries can also be optimized to improve attention locality, so as to avoid the so-called lost-in-the-middle problem. We demonstrate that an attention-optimized span query on a 2b parameter model vastly outperforms the accuracy of a stock inference server using an 8b model.
Problem

Research questions and friction points this paper is trying to address.

Optimizing inference servers for non-chat workloads like RAG and agentic tasks
Improving KV cache locality through generalized interface called span queries
Solving attention locality problems like lost-in-the-middle phenomenon
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing span queries as expression trees with commutativity constraints
Automatically optimizing span queries to improve KV cache locality
Modifying vLLM for high-performance execution of span queries
🔎 Similar Papers
No similar papers found.
Paul Castro
Paul Castro
Senior Research Manager and Scientist, IBM Research
Cloud ComputingMobile Computing
N
Nick Mitchell
IBM Research, New York, USA
N
Nathan Ordonez
IBM Research, Zurich, Switzerland
Thomas Parnell
Thomas Parnell
Principal Research Scientist, IBM Research
Machine Learning and Systems
M
M. Srivatsa
IBM Research, New York, USA
A
Antoni Viros i Martin
IBM Research, Massachusetts, USA