ScIRGen: Synthesize Realistic and Large-Scale RAG Dataset for Scientific Research

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing scientific QA datasets predominantly focus on simple, isolated questions and fail to reflect the complex, multifaceted information needs of real-world researchers. Method: We propose ScIRGen—a novel framework comprising (1) a paper-driven question generation model grounded in cognitive taxonomy; (2) an LLM perplexity-shift mechanism for automated selection of high-difficulty, high-fidelity answers; and (3) a hybrid pipeline integrating academic information extraction with RAG-based data synthesis. Contribution/Results: We introduce ScIRGen-Geo—the first high-quality, domain-specific scientific QA dataset for geoscience (61K samples). Empirical evaluation reveals critical limitations of state-of-the-art models in scientific multi-hop reasoning and cross-document integration, thereby exposing key bottlenecks in current RAG systems. ScIRGen-Geo establishes a more realistic, rigorous benchmark and methodological paradigm for scientific RAG research.

Technology Category

Application Category

📝 Abstract

Scientific researchers need intensive information about datasets to effectively evaluate and develop theories and methodologies. The information needs regarding datasets are implicitly embedded in particular research tasks, rather than explicitly expressed in search queries. However, existing scientific retrieval and question-answering (QA) datasets typically address straightforward questions, which do not align with the distribution of real-world research inquiries. To bridge this gap, we developed ScIRGen, a dataset generation framework for scientific QA &retrieval that more accurately reflects the information needs of professional science researchers, and uses it to create a large-scale scientific retrieval-augmented generation (RAG) dataset with realistic queries, datasets and papers. Technically, we designed a dataset-oriented information extraction method that leverages academic papers to augment the dataset representation. We then proposed a question generation framework by employing cognitive taxonomy to ensure the quality of synthesized questions. We also design a method to automatically filter synthetic answers based on the perplexity shift of LLMs, which is highly aligned with human judgment of answers' validity. Collectively, these methodologies culminated in the creation of the 61k QA dataset, ScIRGen-Geo. We benchmarked representative methods on the ScIRGen-Geo dataset for their question-answering and retrieval capabilities, finding out that current methods still suffer from reasoning from complex questions. This work advances the development of more sophisticated tools to support the intricate information needs of the scientific community.

Problem

Research questions and friction points this paper is trying to address.

Generates realistic scientific QA datasets for research needs

Improves dataset representation using academic paper extraction

Enhances QA quality via cognitive taxonomy and LLM filtering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset-oriented extraction from academic papers

Cognitive taxonomy for question generation

Perplexity shift filters synthetic answers

🔎 Similar Papers

On the Challenges and Opportunities in Generative AI