🤖 AI Summary
Retrieval-augmented language models (RAG) exhibit a nonlinear relationship between query–retrieved context overlap and training efficiency—specifically, test perplexity drops significantly and convergence accelerates only beyond a critical overlap threshold.
Method: We propose a query-rewriting-based synthetic context generation framework to precisely control input–neighbor overlap, coupled with a perplexity-driven overlap optimization strategy.
Contribution/Results: Empirical evaluation demonstrates that our approach reduces pretraining time by approximately 40% while substantially lowering perplexity, without sacrificing downstream QA performance. This work provides the first systematic empirical validation of the nonmonotonic effect of overlap on RAG pretraining dynamics. It establishes a reproducible, deployable optimization paradigm for efficient RAG pretraining, bridging theoretical insight with practical implementation.
📝 Abstract
Retrieval-augmented language models have demonstrated performance comparable to much larger models while requiring fewer computational resources. The effectiveness of these models crucially depends on the overlap between query and retrieved context, but the optimal degree of this overlap remains unexplored. In this paper, we systematically investigate how varying levels of query--context overlap affect model performance during both training and inference. Our experiments reveal that increased overlap initially has minimal effect, but substantially improves test-time perplexity and accelerates model learning above a critical threshold. Building on these findings, we demonstrate that deliberately increasing overlap through synthetic context can enhance data efficiency and reduce training time by approximately 40% without compromising performance. We specifically generate synthetic context through paraphrasing queries. We validate our perplexity-based findings on question-answering tasks, confirming that the benefits of retrieval-augmented language modeling extend to practical applications. Our results provide empirical evidence of significant optimization potential for retrieval mechanisms in language model pretraining.