SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code

📅 2025-12-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating scientific reasoning capabilities in university-level physics remains challenging due to the lack of scalable, dynamic, and executable benchmarks. Method: We introduce PhysBench—the first large-scale, dynamic, executable benchmark for physics reasoning—comprising 15,045 parameterized problems, each accompanied by a structured reasoning chain and runnable Python code. We propose three novel code-execution-based metrics: consistency score, failure rate, and confusion rate, to quantify model stability and uncertainty under parametric perturbations. Our approach integrates symbolic computation with executable code generation to enable dynamic problem expansion and automated evaluation. Contribution/Results: Experiments reveal that mainstream instruction-tuned models possess basic symbolic reasoning capacity but exhibit significant performance volatility. PhysBench effectively uncovers critical model deficiencies—such as sensitivity to parameter variation and logical inconsistency—thereby providing a reproducible, diagnosable evaluation infrastructure for developing robust scientific reasoning systems.

Technology Category

Application Category

📝 Abstract
We introduce, a large-scale synthetic benchmark of 15,045 university-level physics problems (90/10% train/test split). Each problem is fully parameterized, supporting an effectively infinite range of input configurations, and is accompanied by structured, step-by-step reasoning and executable Python code that produces the ground-truth solution for any parameter set. The benchmark contains three question types: MC-Symbolic (multiple-choice with symbolic options), MC-Numerical (multiple-choice with numerical options), and free-form (open-ended responses). These diverse formats test complementary reasoning skills. By leveraging the dynamic, code-driven nature of the benchmark, we introduce three novel evaluation metrics in addition to standard accuracy: Consistency Score, Failure Rate, and Confusion Rate, that quantify variability and uncertainty across problem variants. Experiments with state-of-the-art instruction-tuned language models reveal both strengths and limitations in scientific reasoning, positioning SymPyBench as a foundation for developing more robust and interpretable reasoning systems
Problem

Research questions and friction points this paper is trying to address.

Benchmark tests scientific reasoning with executable Python code
Evaluates models on physics problems with dynamic parameterization
Introduces novel metrics to measure consistency and failure rates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic benchmark with parameterized physics problems
Executable Python code for ground-truth solutions
Novel metrics: Consistency, Failure, Confusion rates
🔎 Similar Papers
No similar papers found.
S
Shima Imani
Meta Reality Lab
Seungwhan Moon
Seungwhan Moon
Facebook, Carnegie Mellon University
Dialog SystemsTransfer LearningMultimodal LearningNatural Language Processing
Adel Ahmadyan
Adel Ahmadyan
Meta Reality Lab
L
Lu Zhang
Meta Reality Lab
K
Kirmani Ahmed
Meta Reality Lab
B
Babak Damavandi
Meta Reality Lab