Efficient Distributed Retrieval-Augmented Generation for Enhancing Language Model Performance

📅 2025-04-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the limitations of small language models (SLMs) on edge devices—namely, degraded reasoning performance due to insufficient local knowledge—and the privacy risks of cloud-based retrieval-augmented generation (RAG) that may expose users’ private documents, this paper proposes DRAGON, a distributed RAG framework. DRAGON enables privacy-preserving retrieval by jointly leveraging a cloud-side general knowledge base and edge-side private documents, without uploading raw private data. Its core contributions are: (1) the first edge-oriented distributed multi-document RAG architecture; (2) a bilateral speculative aggregation algorithm supporting asynchronous, cloud-edge collaborative token generation; and (3) a dynamic aggregation-side scheduling mechanism adaptive to real-time network conditions. Evaluation on a real-world edge platform demonstrates that DRAGON achieves 1.9× higher end-to-end throughput than centralized RAG, significantly reduces per-token latency, and drives time-to-first-token (TTFT) near zero.

Technology Category

Application Category

📝 Abstract

Small language models (SLMs) support efficient deployments on resource-constrained edge devices, but their limited capacity compromises inference performance. Retrieval-augmented generation (RAG) is a promising solution to enhance model performance by integrating external databases, without requiring intensive on-device model retraining. However, large-scale public databases and user-specific private contextual documents are typically located on the cloud and the device separately, while existing RAG implementations are primarily centralized. To bridge this gap, we propose DRAGON, a distributed RAG framework to enhance on-device SLMs through both general and personal knowledge without the risk of leaking document privacy. Specifically, DRAGON decomposes multi-document RAG into multiple parallel token generation processes performed independently and locally on the cloud and the device, and employs a newly designed Speculative Aggregation, a dual-side speculative algorithm to avoid frequent output synchronization between the cloud and device. A new scheduling algorithm is further introduced to identify the optimal aggregation side based on real-time network conditions. Evaluations on real-world hardware testbed demonstrate a significant performance improvement of DRAGON-up to 1.9x greater gains over standalone SLM compared to the centralized RAG, substantial reduction in per-token latency, and negligible Time to First Token (TTFT) overhead.

Problem

Research questions and friction points this paper is trying to address.

Enhancing small language models (SLMs) with distributed retrieval-augmented generation (RAG).

Integrating cloud-based public and on-device private knowledge without privacy leaks.

Optimizing multi-document RAG via parallel token generation and speculative aggregation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributed RAG framework for on-device SLMs

Speculative Aggregation for reduced synchronization

Dynamic scheduling based on network conditions

🔎 Similar Papers

No similar papers found.

Authors to Follow