🤖 AI Summary
This study investigates the reliability of large language models (LLMs) as judges in assessing the novelty of scientific research questions. Addressing the lack of objective benchmarks in existing evaluations, the authors propose anchoring assessments to the original research questions formulated by human authors and introduce RQ-Bench, a novel dataset for systematic evaluation. Through comprehensive experiments combining independent scoring, pairwise comparisons, and expert human judgments, the work reveals for the first time a pervasive “novelty hallucination” in LLMs—whereby they consistently overestimate the novelty of model-generated questions compared to those posed by human researchers, whom domain experts markedly favor. This research not only exposes a critical limitation of current LLMs in scientific reasoning but also establishes a new paradigm for evaluating novelty at the level of research questions.
📝 Abstract
LLMs are increasingly used to generate and judge scientific ideas. This makes novelty evaluation a central problem. Full idea evaluation is difficult because it often requires judging a method, its feasibility, and its empirical promise. We therefore study a cleaner upstream object: the research question (RQ). RQ generation is a prerequisite for scientific ideation, and RQs can be compared against questions pursued in real papers. We introduce RQ-Bench, a benchmark built from recent arXiv papers. For each paper, we reconstruct author-anchored RQs from its cited background, gaps, and contributions. These RQs are not the only valid questions for the same background. They are author-anchored reference points for testing novelty judgments. We evaluate model-generated RQs with standalone LLM judging, comparative LLM judging, and human expert evaluation. LLM judges consistently rate model-generated RQs as highly novel, producing a novelty mirage; in comparative evaluations, this preference becomes even stronger. Domain experts, however, reach the opposite conclusion and prefer the author-anchored reference questions. We further find that many generated RQs are narrow or source-bound, a dimension that LLM judges often miss unless explicitly tested. Overall, the contradictory novelty evaluations between LLM judges and human experts raise a serious concern about the reliability of using LLMs to assess the scientific novelty of research questions.