Studying the Role of Input-Neighbor Overlap in Retrieval-Augmented Language Models Training Efficiency

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Retrieval-augmented language models (RAG) exhibit a nonlinear relationship between query–retrieved context overlap and training efficiency—specifically, test perplexity drops significantly and convergence accelerates only beyond a critical overlap threshold. Method: We propose a query-rewriting-based synthetic context generation framework to precisely control input–neighbor overlap, coupled with a perplexity-driven overlap optimization strategy. Contribution/Results: Empirical evaluation demonstrates that our approach reduces pretraining time by approximately 40% while substantially lowering perplexity, without sacrificing downstream QA performance. This work provides the first systematic empirical validation of the nonmonotonic effect of overlap on RAG pretraining dynamics. It establishes a reproducible, deployable optimization paradigm for efficient RAG pretraining, bridging theoretical insight with practical implementation.

Technology Category

Application Category

📝 Abstract

Retrieval-augmented language models have demonstrated performance comparable to much larger models while requiring fewer computational resources. The effectiveness of these models crucially depends on the overlap between query and retrieved context, but the optimal degree of this overlap remains unexplored. In this paper, we systematically investigate how varying levels of query--context overlap affect model performance during both training and inference. Our experiments reveal that increased overlap initially has minimal effect, but substantially improves test-time perplexity and accelerates model learning above a critical threshold. Building on these findings, we demonstrate that deliberately increasing overlap through synthetic context can enhance data efficiency and reduce training time by approximately 40% without compromising performance. We specifically generate synthetic context through paraphrasing queries. We validate our perplexity-based findings on question-answering tasks, confirming that the benefits of retrieval-augmented language modeling extend to practical applications. Our results provide empirical evidence of significant optimization potential for retrieval mechanisms in language model pretraining.

Problem

Research questions and friction points this paper is trying to address.

Investigates optimal query-context overlap in retrieval-augmented language models

Explores synthetic context generation to improve training efficiency

Validates perplexity benefits on question-answering tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Investigates query-context overlap impact on performance

Uses synthetic context to boost data efficiency

Reduces training time by 40% via paraphrasing

🔎 Similar Papers

No similar papers found.