🤖 AI Summary
This work proposes ForeSci—the first temporally controlled benchmark designed to evaluate forward-looking AI research judgment—comprising 500 tasks spanning four cutting-edge AI domains and four types of research decisions. By truncating knowledge bases to conceal future publications, the benchmark assesses large language model (LLM) agents’ predictive capabilities using only historical information. It innovatively frames prospective judgment as a classification task grounded in historical evidence signals, thereby mitigating random guessing and uncovering a decoupling between evidence citation and decision-making in agents. Experiments demonstrate that explicitly structuring evidence enhances the traceability and factual grounding of judgments, though the benefits vary across decision types. Evaluations of diverse agent architectures across four backbone models further validate the benchmark’s effectiveness and inherent challenges.
📝 Abstract
AI research often requires decisions before future evidence exists: which bottleneck to attack, which direction to pursue, or where a project should be positioned. We introduce ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make such forward-looking research judgements from historical evidence. ForeSci contains 500 tasks across four fast-moving AI domains and four decision families. Each task is paired with a cutoff-aligned offline knowledge base; post-cutoff papers are hidden during generation and used only for validation. To avoid random future-event prediction, tasks are derived from pre-cutoff taxonomy branches and evidence signals, and answer-generation backbones are selected to precede the task cutoffs. We evaluate native LLMs, Hybrid RAG, and three research-agent adaptations across four backbones. Results show that explicit evidence organization improves traceability and factual support, but gains depend strongly on the decision family. Diagnostics reveal a recurring evidence-decision decoupling: agents may cite relevant evidence while forecasting the wrong research object. ForeSci turns forward-looking AI research judgement into a controlled benchmark for evaluating research agents as decision-making systems.