SubGCache: Accelerating Graph-based RAG with Subgraph-level KV Cache

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In graph-structured retrieval-augmented generation (RAG), repetitive large language model (LLM) inference triggered by semantically similar subgraph prompts leads to high time-to-first-token (TTFT). To address this, we propose a subgraph-level KV cache reuse framework. Our approach introduces three key contributions: (1) the first formal definition of a subgraph-granular KV cache unit, enabling structured prompt reuse across queries; (2) a subgraph embedding learning and hierarchical clustering method to automatically identify representative subgraphs and precompute their KV states; and (3) a dynamic cache scheduling mechanism jointly optimizing graph retrieval and text generation. Evaluated on multiple LLMs and graph RAG benchmarks, our method reduces TTFT by up to 6.68× while preserving or improving generation quality.

Technology Category

Application Category

📝 Abstract
Graph-based retrieval-augmented generation (RAG) enables large language models (LLMs) to incorporate structured knowledge via graph retrieval as contextual input, enhancing more accurate and context-aware reasoning. We observe that for different queries, it could retrieve similar subgraphs as prompts, and thus we propose SubGCache, which aims to reduce inference latency by reusing computation across queries with similar structural prompts (i.e., subgraphs). Specifically, SubGCache clusters queries based on subgraph embeddings, constructs a representative subgraph for each cluster, and pre-computes the key-value (KV) cache of the representative subgraph. For each query with its retrieved subgraph within a cluster, it reuses the pre-computed KV cache of the representative subgraph of the cluster without computing the KV tensors again for saving computation. Experiments on two new datasets across multiple LLM backbones and graph-based RAG frameworks demonstrate that SubGCache consistently reduces inference latency with comparable and even improved generation quality, achieving up to 6.68$ imes$ reduction in time-to-first-token (TTFT).
Problem

Research questions and friction points this paper is trying to address.

Reducing inference latency in graph-based RAG
Reusing computation for queries with similar subgraphs
Improving efficiency without sacrificing generation quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Clusters queries via subgraph embeddings
Pre-computes KV cache for representative subgraphs
Reuses KV cache to reduce inference latency
Qiuyu Zhu
Qiuyu Zhu
Nanyang Technological University
L
Liang Zhang
Hong Kong University of Science and Technology (Guangzhou)
Q
Qianxiong Xu
Nanyang Technological University
Cheng Long
Cheng Long
Nanyang Technological University
databasesmachine learningdata mining
J
Jie Zhang
Nanyang Technological University