QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This work addresses the high computational overhead during the prefill phase in Retrieval-Augmented Generation (RAG) services, where existing cache fusion methods struggle to balance generation quality and inference efficiency. The authors propose a query-aware cache fusion mechanism that leverages compressed views, block-anchor query probing, and critical-layer analysis to accurately identify tokens requiring recomputation—without necessitating full-layer inspection—thereby overcoming the dependency of conventional approaches on full-context or full-layer visibility. Implemented within SGLang, the system achieves generation quality on par with full prefill across four open-source large language models and six benchmark datasets, while delivering 1.7× faster prefill speed than full prefill and 1.5× speedup over ProphetKV at equivalent output quality.

📝 Abstract

Retrieval-augmented generation (RAG) improves large language model (LLM) answer quality by grounding generation in external evidence, but processing retrieved contexts makes the prefill stage a dominant serving cost. RAG cache fusion reduces this cost by reusing precomputed key-value (KV) caches for retrieved chunks and selectively recomputing tokens under the current prompt. Existing selectors, however, face a dilemma between quality and efficiency: fast query-agnostic or final-layer query-to-context selectors can miss request-relevant evidence, whereas full-view query-aware selectors require broad context and layer visibility before recomputation and therefore stall the layer-wise cache-fusion pipeline. We present QCFuse, a compressed-view query-aware selector for RAG cache fusion. QCFuse uses chunk-anchor query probing to condition user-query states on compact per-chunk anchors and critical-layer profiling to identify recomputation tokens without all-layer inspection. We implement QCFuse in SGLang and evaluate it on four open-weight LLMs across six datasets. QCFuse reaches full-prefill-level quality. At matched quality, QCFuse achieves an average prefill-time speedup of 1.7x over full prefill and 1.5x over ProphetKV, the strongest quality-preserving baseline.

Problem

Research questions and friction points this paper is trying to address.

RAG

cache fusion

query-aware selection

prefill cost

KV cache reuse

Innovation

Methods, ideas, or system contributions that make the work stand out.

query-aware caching

compressed view

RAG cache fusion