🤖 AI Summary
This study addresses the pervasive hallucination problem—such as fabricated papers and spurious citations—in large language models (LLMs) when processing scientific literature. We introduce ArxEval, the first fine-grained evaluation framework tailored to arXiv scientific publications. Methodologically, we propose a dual-task paradigm (Jumbled Titles & Mixed Titles), integrating controlled title perturbation, chain-of-evidence fact-checking, arXiv metadata anchoring, and human-in-the-loop annotation to enable end-to-end automated assessment. We conduct the first reproducible, cross-model evaluation of 15 state-of-the-art LLMs, revealing an alarmingly high scientific citation hallucination rate of 68%. Key contributions include: (1) the first open-source, extensible benchmark for scientific credibility; (2) systematic empirical characterization of hallucination patterns in scientific contexts; and (3) a standardized evaluation pipeline with rigorously defined quantitative metrics.
📝 Abstract
Language Models [LMs] are now playing an increasingly large role in information generation and synthesis; the representation of scientific knowledge in these systems needs to be highly accurate. A prime challenge is hallucination; that is, generating apparently plausible but actually false information, including invented citations and nonexistent research papers. This kind of inaccuracy is dangerous in all the domains that require high levels of factual correctness, such as academia and education. This work presents a pipeline for evaluating the frequency with which language models hallucinate in generating responses in the scientific literature. We propose ArxEval, an evaluation pipeline with two tasks using ArXiv as a repository: Jumbled Titles and Mixed Titles. Our evaluation includes fifteen widely used language models and provides comparative insights into their reliability in handling scientific literature.