SciR: A Controllable Benchmark for Scientific Reasoning in LLMs

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing benchmarks for scientific reasoning lack verifiable mechanisms, suffer from high costs of human annotation, and often rely on synthetic data with insufficient realism. This work proposes SciR, a controllable scientific reasoning benchmark grounded in three reasoning paradigms—deduction, induction, and causal abduction. SciR generates tasks using formal structures (deduction trees, inductive hypotheses, and causal graphs) and synthesizes multi-document scientific passages through domain-adapted stylistic rendering. Crucially, it enables the first independent parametric control over both information extraction difficulty and reasoning complexity. Experiments demonstrate that both factors significantly and additively impact model performance, and that models specifically optimized for reasoning substantially outperform general instruction-tuned models along the reasoning dimension. SciR thus supports multi-paradigm, verifiable, and authentic scientific writing–aligned evaluation.

📝 Abstract

Three paradigmatic forms of inference recur across scientific reasoning: deduction, induction, and causal abduction. Reliably evaluating LLMs on these in scientific settings is currently out of reach: scientific benchmarks built on human annotations are costly and lack mechanistic ground truth, while synthetic logical-reasoning benchmarks do not resemble real scientific documents. We introduce SciR, a benchmark that combines multi-paradigm reasoning with controllable scientific rendering, anchored on three paradigmatic scientific problems. Tasks are generated from formal objects (deduction tree, inductive rule hypothesis, causal graph) to guarantee verifiable answers, then rendered into multi-document scientific discourse via per-track domain-tuned genres. The construction lets us independently vary two difficulty axes: how hard it is to extract the key information needed for inference, and how hard the principled inference itself is. We test six models. Both axes hurt every model, and their effects compound. The rendering even hurts neurosymbolic pipelines, which hand inference to a verified solver. The two axes yield a per-model extraction-vs-inference profile: for instance, reasoning models like deepseek-r1 mostly surpass non-reasoning instruct models on the inference axis. To our knowledge, SciR is the first multi-paradigm scientific-reasoning benchmark with parametric control on both extraction and inference difficulty.

Problem

Research questions and friction points this paper is trying to address.

scientific reasoning

large language models

benchmark

deduction

induction

Innovation

Methods, ideas, or system contributions that make the work stand out.

scientific reasoning

controllable benchmark

multi-paradigm inference