ScIRGen: Synthesize Realistic and Large-Scale RAG Dataset for Scientific Research

๐Ÿ“… 2025-06-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing scientific QA datasets predominantly focus on simple, isolated questions and fail to reflect the complex, multifaceted information needs of real-world researchers. Method: We propose ScIRGenโ€”a novel framework comprising (1) a paper-driven question generation model grounded in cognitive taxonomy; (2) an LLM perplexity-shift mechanism for automated selection of high-difficulty, high-fidelity answers; and (3) a hybrid pipeline integrating academic information extraction with RAG-based data synthesis. Contribution/Results: We introduce ScIRGen-Geoโ€”the first high-quality, domain-specific scientific QA dataset for geoscience (61K samples). Empirical evaluation reveals critical limitations of state-of-the-art models in scientific multi-hop reasoning and cross-document integration, thereby exposing key bottlenecks in current RAG systems. ScIRGen-Geo establishes a more realistic, rigorous benchmark and methodological paradigm for scientific RAG research.

Technology Category

Application Category

๐Ÿ“ Abstract
Scientific researchers need intensive information about datasets to effectively evaluate and develop theories and methodologies. The information needs regarding datasets are implicitly embedded in particular research tasks, rather than explicitly expressed in search queries. However, existing scientific retrieval and question-answering (QA) datasets typically address straightforward questions, which do not align with the distribution of real-world research inquiries. To bridge this gap, we developed ScIRGen, a dataset generation framework for scientific QA &retrieval that more accurately reflects the information needs of professional science researchers, and uses it to create a large-scale scientific retrieval-augmented generation (RAG) dataset with realistic queries, datasets and papers. Technically, we designed a dataset-oriented information extraction method that leverages academic papers to augment the dataset representation. We then proposed a question generation framework by employing cognitive taxonomy to ensure the quality of synthesized questions. We also design a method to automatically filter synthetic answers based on the perplexity shift of LLMs, which is highly aligned with human judgment of answers' validity. Collectively, these methodologies culminated in the creation of the 61k QA dataset, ScIRGen-Geo. We benchmarked representative methods on the ScIRGen-Geo dataset for their question-answering and retrieval capabilities, finding out that current methods still suffer from reasoning from complex questions. This work advances the development of more sophisticated tools to support the intricate information needs of the scientific community.
Problem

Research questions and friction points this paper is trying to address.

Generates realistic scientific QA datasets for research needs
Improves dataset representation using academic paper extraction
Enhances QA quality via cognitive taxonomy and LLM filtering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset-oriented extraction from academic papers
Cognitive taxonomy for question generation
Perplexity shift filters synthetic answers
๐Ÿ”Ž Similar Papers
J
Junyong Lin
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, Guangdong, China
Lu Dai
Lu Dai
Hong Kong University of Science and Technology
R
Ruiqian Han
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, Guangdong, China
Y
Yijie Sui
Institute of Tibetan Plateau Research, Chinese Academy of Sciences, Beijing, China
R
Ruilin Wang
Lanzhou University, Lanzhou, Gansu, China
X
Xingliang Sun
Lanzhou University, Lanzhou, Gansu, China
Q
Qinglin Wu
Institute of Tibetan Plateau Research, Chinese Academy of Sciences, Beijing, China
M
Min Feng
Institute of Tibetan Plateau Research, Chinese Academy of Sciences, Beijing, China; College of Resources and Environment, University of Chinese Academy of Sciences, Beijing, China
H
Hao Liu
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, Guangdong, China; The Hong Kong University of Science and Technology, Hong Kong SAR, Hong Kong
Hui Xiong
Hui Xiong
Senior Scientist, Candela Corporation
Ultrafast dynamicsatomic molecular physicsfree electron laser