SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

📅 2026-04-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether large language models (LLMs) can accurately predict outcomes of natural science experiments and reliably support scientific decision-making. The authors introduce SciPredict, a novel benchmark comprising 405 recent empirical tasks spanning 33 subfields across physics, biology, and chemistry, to systematically evaluate LLMs’ predictive accuracy and calibration—complemented by a human expert control experiment. Results reveal that leading LLMs achieve only 14–26% accuracy; although some models slightly surpass the average human expert performance (~20%), they remain far from practically useful in research contexts. Moreover, LLMs consistently exhibit poor calibration, demonstrating markedly weaker self-assessment of prediction reliability compared to human experts. This work establishes the first dedicated benchmark and analytical framework for evaluating scientific reasoning capabilities in LLMs.

Technology Category

Application Category

📝 Abstract
Accelerating scientific discovery requires the identification of which experiments would yield the best outcomes before committing resources to costly physical validation. While existing benchmarks evaluate LLMs on scientific knowledge and reasoning, their ability to predict experimental outcomes - a task where AI could significantly exceed human capabilities - remains largely underexplored. We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry. SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process? Evaluations reveal fundamental limitations on both fronts. Model accuracies are 14-26% and human expert performance is $\approx$20%. Although some frontier models exceed human performance model accuracy is still far below what would enable reliable experimental guidance. Even within the limited performance, models fail to distinguish reliable predictions from unreliable ones, achieving only $\approx$20% accuracy regardless of their confidence or whether they judge outcomes as predictable without physical experimentation. Human experts, in contrast, demonstrate strong calibration: their accuracy increases from $\approx$5% to $\approx$80% as they deem outcomes more predictable without conducting the experiment. SciPredict establishes a rigorous framework demonstrating that superhuman performance in experimental science requires not just better predictions, but better awareness of prediction reliability. For reproducibility all our data and code are provided at https://github.com/scaleapi/scipredict
Problem

Research questions and friction points this paper is trying to address.

scientific experiment prediction
large language models
outcome reliability
scientific discovery
experimental guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

SciPredict
experimental outcome prediction
large language models
prediction calibration
scientific discovery
🔎 Similar Papers
No similar papers found.
Udari Madhushani Sehwag
Udari Madhushani Sehwag
Research Scientist, Scale AI
Agentic AIAlignmentScalable oversightAI SafetyMulti-agent RL
Elaine Lau
Elaine Lau
McGill University, Mila, Scale AI
deep learningreinforcement learningnatural language processing
H
Haniyeh Ehsani Oskouie
University of California, Los Angeles
S
Shayan Shabihi
University of Maryland
Erich Liang
Erich Liang
Princeton
3D Computer VisionMachine LearningMathematics
A
Andrea Toledo
Scale AI
G
Guillermo Mangialardi
Scale AI
S
Sergio Fonrouge
Scale AI
E
Ed-Yeremai Hernandez Cardona
Scale AI
P
Paula Vergara
Scale AI
Utkarsh Tyagi
Utkarsh Tyagi
University of Maryland, College Park
AIMachine LearningNLPMultimodal
C
Chen Bo Calvin Zhang
Scale AI
P
Pavi Bhatter
Scale AI
N
Nicholas Johnson
Scale AI
Furong Huang
Furong Huang
Associate Professor of Computer Science, University of Maryland
Trustworthy AI/MLReinforcement LearningGenerative AI
E
Ernesto Gabriel Hernandez Montoya
Scale AI
B
Bing Liu
Scale AI