HedraRAG: Coordinating LLM Generation and Database Retrieval in Heterogeneous RAG Serving

📅 2025-07-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low execution efficiency in heterogeneous RAG services caused by complex multi-stage workflows and diverse request patterns, this paper proposes a graph-abstraction-based runtime system. It models RAG workflows as dynamic directed graphs and applies node splitting, topological reordering, edge reconstruction, and dependency optimization to jointly exploit stage-level parallelism, intra-request similarity, and inter-request load skew. Coupled with a hybrid CPU-GPU pipelining architecture and adaptive scheduling, the system achieves efficient resource utilization. Experiments demonstrate 1.5–5× speedup over state-of-the-art frameworks across diverse RAG workflows, significantly reducing end-to-end latency while improving throughput and GPU utilization. The key contribution is the first introduction of dynamic graph transformation into RAG runtimes, enabling cross-request subgraph wavefront co-optimization.

Technology Category

Application Category

📝 Abstract
This paper addresses emerging system-level challenges in heterogeneous retrieval-augmented generation (RAG) serving, where complex multi-stage workflows and diverse request patterns complicate efficient execution. We present HedraRAG, a runtime system built on a graph-based abstraction that exposes optimization opportunities across stage-level parallelism, intra-request similarity, and inter-request skewness. These opportunities are realized through dynamic graph transformations, such as node splitting, reordering, edge addition, and dependency rewiring, applied to wavefronts of subgraphs spanning concurrent requests. The resulting execution plans are mapped onto hybrid CPU-GPU pipelines to improve resource utilization and reduce latency. Evaluations across a wide range of RAG workflows demonstrate speedups exceeding 1.5x and reaching up to 5x over existing frameworks, showcasing the effectiveness of coordinated generation and retrieval in serving environments.
Problem

Research questions and friction points this paper is trying to address.

Optimizing heterogeneous RAG workflows for efficiency
Managing diverse request patterns in RAG systems
Coordinating LLM generation with database retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-based abstraction for heterogeneous RAG optimization
Dynamic graph transformations to enhance execution efficiency
Hybrid CPU-GPU pipelines for improved resource utilization
🔎 Similar Papers
No similar papers found.
Z
Zhengding Hu
University of California San Diego
V
Vibha Murthy
University of California San Diego
Zaifeng Pan
Zaifeng Pan
University of California, San Diego
Machine Learning Systems
W
Wanlu Li
University of California San Diego
X
Xiaoyi Fang
RegAilator Inc
Y
Yufei Ding
University of California San Diego
Y
Yuke Wang
Rice University