ArxEval: Evaluating Retrieval and Generation in Language Models for Scientific Literature

📅 2025-01-17

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This study addresses the pervasive hallucination problem—such as fabricated papers and spurious citations—in large language models (LLMs) when processing scientific literature. We introduce ArxEval, the first fine-grained evaluation framework tailored to arXiv scientific publications. Methodologically, we propose a dual-task paradigm (Jumbled Titles & Mixed Titles), integrating controlled title perturbation, chain-of-evidence fact-checking, arXiv metadata anchoring, and human-in-the-loop annotation to enable end-to-end automated assessment. We conduct the first reproducible, cross-model evaluation of 15 state-of-the-art LLMs, revealing an alarmingly high scientific citation hallucination rate of 68%. Key contributions include: (1) the first open-source, extensible benchmark for scientific credibility; (2) systematic empirical characterization of hallucination patterns in scientific contexts; and (3) a standardized evaluation pipeline with rigorously defined quantitative metrics.

Technology Category

Application Category

📝 Abstract

Language Models [LMs] are now playing an increasingly large role in information generation and synthesis; the representation of scientific knowledge in these systems needs to be highly accurate. A prime challenge is hallucination; that is, generating apparently plausible but actually false information, including invented citations and nonexistent research papers. This kind of inaccuracy is dangerous in all the domains that require high levels of factual correctness, such as academia and education. This work presents a pipeline for evaluating the frequency with which language models hallucinate in generating responses in the scientific literature. We propose ArxEval, an evaluation pipeline with two tasks using ArXiv as a repository: Jumbled Titles and Mixed Titles. Our evaluation includes fifteen widely used language models and provides comparative insights into their reliability in handling scientific literature.

Problem

Research questions and friction points this paper is trying to address.

Language Models

Scientific Information Accuracy

ArXiv Database

Innovation

Methods, ideas, or system contributions that make the work stand out.

ArxEval

Error Rate Quantification

Language Model Reliability

🔎 Similar Papers

No similar papers found.