ArxEval: Evaluating Retrieval and Generation in Language Models for Scientific Literature

📅 2025-01-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the pervasive hallucination problem—such as fabricated papers and spurious citations—in large language models (LLMs) when processing scientific literature. We introduce ArxEval, the first fine-grained evaluation framework tailored to arXiv scientific publications. Methodologically, we propose a dual-task paradigm (Jumbled Titles & Mixed Titles), integrating controlled title perturbation, chain-of-evidence fact-checking, arXiv metadata anchoring, and human-in-the-loop annotation to enable end-to-end automated assessment. We conduct the first reproducible, cross-model evaluation of 15 state-of-the-art LLMs, revealing an alarmingly high scientific citation hallucination rate of 68%. Key contributions include: (1) the first open-source, extensible benchmark for scientific credibility; (2) systematic empirical characterization of hallucination patterns in scientific contexts; and (3) a standardized evaluation pipeline with rigorously defined quantitative metrics.

Technology Category

Application Category

📝 Abstract
Language Models [LMs] are now playing an increasingly large role in information generation and synthesis; the representation of scientific knowledge in these systems needs to be highly accurate. A prime challenge is hallucination; that is, generating apparently plausible but actually false information, including invented citations and nonexistent research papers. This kind of inaccuracy is dangerous in all the domains that require high levels of factual correctness, such as academia and education. This work presents a pipeline for evaluating the frequency with which language models hallucinate in generating responses in the scientific literature. We propose ArxEval, an evaluation pipeline with two tasks using ArXiv as a repository: Jumbled Titles and Mixed Titles. Our evaluation includes fifteen widely used language models and provides comparative insights into their reliability in handling scientific literature.
Problem

Research questions and friction points this paper is trying to address.

Language Models
Scientific Information Accuracy
ArXiv Database
Innovation

Methods, ideas, or system contributions that make the work stand out.

ArxEval
Error Rate Quantification
Language Model Reliability
🔎 Similar Papers
No similar papers found.
Aarush Sinha
Aarush Sinha
University of Copenhagen
Natural Language ProcessingInformation RetrievalMachine LearningMultimodality
V
Viraj Virk
School of Computer Science and Engineering, Vellore Institute of Technology, Chennai - 600127, India
D
Dipshikha Chakraborty
School of Computer Science and Engineering, Vellore Institute of Technology, Chennai - 600127, India
P
P. S. Sreeja
School of Computer Science and Engineering, Vellore Institute of Technology, Chennai - 600127, India