🤖 AI Summary
Current LLM scientific capability evaluation is hindered by the high cost of wet-lab experimentation, limiting systematic assessment of their experimental design and result interpretation abilities in complex biological systems. To address this, we introduce SciGym—the first dry-lab benchmark for evaluating LLMs’ scientific reasoning—comprising 137 SBML-encoded biological systems and 350 executable dynamic models that support multi-turn, closed-loop experimental design and automated evaluation. SciGym enables the first quantitative assessment of LLMs’ iterative scientific reasoning, revealing a critical performance bottleneck: capabilities degrade significantly with increasing system complexity. Empirical evaluation across six state-of-the-art LLMs shows that while larger models generally perform better, all exhibit substantial weaknesses on high-complexity tasks, underscoring fundamental limitations in current LLM scientific reasoning. This work establishes a new paradigm for scalable, low-cost, high-fidelity AI scientific capability evaluation.
📝 Abstract
Designing experiments and result interpretations are core scientific competencies, particularly in biology, where researchers perturb complex systems to uncover the underlying systems. Recent efforts to evaluate the scientific capabilities of large language models (LLMs) fail to test these competencies because wet-lab experimentation is prohibitively expensive: in expertise, time and equipment. We introduce SciGym, a first-in-class benchmark that assesses LLMs' iterative experiment design and analysis abilities in open-ended scientific discovery tasks. SciGym overcomes the challenge of wet-lab costs by running a dry lab of biological systems. These models, encoded in Systems Biology Markup Language, are efficient for generating simulated data, making them ideal testbeds for experimentation on realistically complex systems. We evaluated six frontier LLMs on 137 small systems, and released a total of 350 systems. Our evaluation shows that while more capable models demonstrated superior performance, all models' performance declined significantly as system complexity increased, suggesting substantial room for improvement in the scientific capabilities of LLM agents.