๐ค AI Summary
Existing scientific QA datasets predominantly focus on simple, isolated questions and fail to reflect the complex, multifaceted information needs of real-world researchers.
Method: We propose ScIRGenโa novel framework comprising (1) a paper-driven question generation model grounded in cognitive taxonomy; (2) an LLM perplexity-shift mechanism for automated selection of high-difficulty, high-fidelity answers; and (3) a hybrid pipeline integrating academic information extraction with RAG-based data synthesis.
Contribution/Results: We introduce ScIRGen-Geoโthe first high-quality, domain-specific scientific QA dataset for geoscience (61K samples). Empirical evaluation reveals critical limitations of state-of-the-art models in scientific multi-hop reasoning and cross-document integration, thereby exposing key bottlenecks in current RAG systems. ScIRGen-Geo establishes a more realistic, rigorous benchmark and methodological paradigm for scientific RAG research.
๐ Abstract
Scientific researchers need intensive information about datasets to effectively evaluate and develop theories and methodologies. The information needs regarding datasets are implicitly embedded in particular research tasks, rather than explicitly expressed in search queries. However, existing scientific retrieval and question-answering (QA) datasets typically address straightforward questions, which do not align with the distribution of real-world research inquiries. To bridge this gap, we developed ScIRGen, a dataset generation framework for scientific QA &retrieval that more accurately reflects the information needs of professional science researchers, and uses it to create a large-scale scientific retrieval-augmented generation (RAG) dataset with realistic queries, datasets and papers. Technically, we designed a dataset-oriented information extraction method that leverages academic papers to augment the dataset representation. We then proposed a question generation framework by employing cognitive taxonomy to ensure the quality of synthesized questions. We also design a method to automatically filter synthetic answers based on the perplexity shift of LLMs, which is highly aligned with human judgment of answers' validity. Collectively, these methodologies culminated in the creation of the 61k QA dataset, ScIRGen-Geo. We benchmarked representative methods on the ScIRGen-Geo dataset for their question-answering and retrieval capabilities, finding out that current methods still suffer from reasoning from complex questions. This work advances the development of more sophisticated tools to support the intricate information needs of the scientific community.