Parametric Retrieval-Augmented Generation using Latent Routing of LoRA Adapters

📅 2025-11-21

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing parametric retrieval-augmented generation (PRAG) methods employ a one-to-one document-to-LoRA mapping, resulting in severe data sparsity per adapter and computationally expensive weight merging during inference. To address these limitations, we propose Poly-PRAG: instead of fixed mappings, it introduces a latent-variable routing mechanism that dynamically allocates knowledge from large-scale documents to a shared pool of LoRA experts within a multi-task learning framework—enabling sparse activation and efficient parametric encoding. Poly-PRAG unifies low-rank adaptation, offline indexing, and online retrieval to jointly mitigate data sparsity and inference bottlenecks. Evaluated on four knowledge-intensive NLP tasks, it achieves state-of-the-art performance while reducing training data requirements by 37% and inference latency by 52%.

Technology Category

Application Category

📝 Abstract

Parametric Retrieval-Augmented Generation (PRAG) is a novel RAG paradigm that integrates external knowledge directly into a Large Language Model (LLM) by parameterizing documents using LoRA adapters, demonstrating reduced inference costs compared to traditional RAG approaches. However, current PRAG approaches adopt a extbf{one-to-one} document encoding scheme, using a dedicated LoRA adapter for each individual document. This scheme introduces two major limitations: First, it leads to data scarcity, as the training datasets for individual LoRA adapters are limited. Second, it incurs high overhead during inference, requiring the merging of LLM weights with a new LoRA adapter for every candidate passage, which is computationally inefficient. To overcome these challenges, we propose a novel paradigm for encoding passages in PRAG that utilizes a latent routing encoding process (Poly-PRAG). During offline encoding, we treat the encoding of a set of documents as a multi-task learning process, where each passage is assigned a unique task identifier. By employing a routing function, we use a small set of latent LoRA adapters to encode the entire passage space. During online inference, this routing function selectively activates a subset of latent experts based on the input query. We conduct comprehensive evaluations of Poly-PRAG across multiple knowledge-intensive NLP tasks. Our extensive experiments demonstrate the effectiveness of the proposed method, achieving state-of-the-art results on four distinct datasets.

Problem

Research questions and friction points this paper is trying to address.

Overcoming data scarcity in parametric retrieval-augmented generation systems

Reducing computational overhead from individual document LoRA adapters

Improving inference efficiency in knowledge-intensive language tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent routing encodes documents using LoRA adapters

Multi-task learning assigns unique identifiers to passages

Selective activation of experts based on input query

🔎 Similar Papers

FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research