CL-RAG: Bridging the Gap in Retrieval-Augmented Generation with Curriculum Learning

📅 2025-05-15

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Existing RAG systems suffer from unstable end-to-end training and limited generalization due to high variance in retrieved document quality. Method: This paper introduces curriculum learning to RAG joint training for the first time, proposing a staged, progressive framework that constructs a multi-level difficulty evolution mechanism based on relevance and semantic complexity, enabling collaborative optimization of the retriever and generator. The approach requires no additional annotations, is compatible with mainstream RAG architectures, and incorporates open-domain question answering data augmentation. Contribution/Results: Evaluated on four standard open-domain QA benchmarks, our method consistently improves accuracy by 2–4%, significantly outperforming state-of-the-art RAG methods. These results empirically validate the effectiveness and generalizability of curriculum learning for joint retrieval-generation optimization.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) is an effective method to enhance the capabilities of large language models (LLMs). Existing methods focus on optimizing the retriever or generator in the RAG system by directly utilizing the top-k retrieved documents. However, the documents effectiveness are various significantly across user queries, i.e. some documents provide valuable knowledge while others totally lack critical information. It hinders the retriever and generator's adaptation during training. Inspired by human cognitive learning, curriculum learning trains models using samples progressing from easy to difficult, thus enhancing their generalization ability, and we integrate this effective paradigm to the training of the RAG system. In this paper, we propose a multi-stage Curriculum Learning based RAG system training framework, named CL-RAG. We first construct training data with multiple difficulty levels for the retriever and generator separately through sample evolution. Then, we train the model in stages based on the curriculum learning approach, thereby optimizing the overall performance and generalization of the RAG system more effectively. Our CL-RAG framework demonstrates consistent effectiveness across four open-domain QA datasets, achieving performance gains of 2% to 4% over multiple advanced methods.

Problem

Research questions and friction points this paper is trying to address.

Improving RAG system performance with curriculum learning

Addressing varying document effectiveness in retrieval-augmented generation

Enhancing retriever and generator adaptation through staged training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage curriculum learning for RAG training

Sample evolution for difficulty-level data construction

Separate retriever and generator optimization stages

🔎 Similar Papers

No similar papers found.