CL-RAG: Bridging the Gap in Retrieval-Augmented Generation with Curriculum Learning

📅 2025-05-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RAG systems suffer from unstable end-to-end training and limited generalization due to high variance in retrieved document quality. Method: This paper introduces curriculum learning to RAG joint training for the first time, proposing a staged, progressive framework that constructs a multi-level difficulty evolution mechanism based on relevance and semantic complexity, enabling collaborative optimization of the retriever and generator. The approach requires no additional annotations, is compatible with mainstream RAG architectures, and incorporates open-domain question answering data augmentation. Contribution/Results: Evaluated on four standard open-domain QA benchmarks, our method consistently improves accuracy by 2–4%, significantly outperforming state-of-the-art RAG methods. These results empirically validate the effectiveness and generalizability of curriculum learning for joint retrieval-generation optimization.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) is an effective method to enhance the capabilities of large language models (LLMs). Existing methods focus on optimizing the retriever or generator in the RAG system by directly utilizing the top-k retrieved documents. However, the documents effectiveness are various significantly across user queries, i.e. some documents provide valuable knowledge while others totally lack critical information. It hinders the retriever and generator's adaptation during training. Inspired by human cognitive learning, curriculum learning trains models using samples progressing from easy to difficult, thus enhancing their generalization ability, and we integrate this effective paradigm to the training of the RAG system. In this paper, we propose a multi-stage Curriculum Learning based RAG system training framework, named CL-RAG. We first construct training data with multiple difficulty levels for the retriever and generator separately through sample evolution. Then, we train the model in stages based on the curriculum learning approach, thereby optimizing the overall performance and generalization of the RAG system more effectively. Our CL-RAG framework demonstrates consistent effectiveness across four open-domain QA datasets, achieving performance gains of 2% to 4% over multiple advanced methods.
Problem

Research questions and friction points this paper is trying to address.

Improving RAG system performance with curriculum learning
Addressing varying document effectiveness in retrieval-augmented generation
Enhancing retriever and generator adaptation through staged training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage curriculum learning for RAG training
Sample evolution for difficulty-level data construction
Separate retriever and generator optimization stages
🔎 Similar Papers
No similar papers found.
S
Shaohan Wang
University of Science and Technology of China, Hefei, China
L
L. Zhang
University of Science and Technology of China, Hefei, China
Zheren Fu
Zheren Fu
University of Science and Technology of China
Multi-modal LearningVision-Language ModelAI Security
Zhendong Mao
Zhendong Mao
University of Science and Technology of China
CV,NLP