PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current large language models exhibit significantly weaker causal and procedural reasoning capabilities in real-world physical scenarios compared to human experts. Method: We introduce PHYBench, the first comprehensive benchmark for physics-grounded reasoning, covering six domains—including mechanics and electromagnetism—with 500 hierarchically structured problems. We propose Expression Edit Distance (EED) as a fine-grained evaluation metric, enabling quantitative, step-by-step comparison between model-generated reasoning paths and expert solutions. PHYBench ensures validity and reliability through realistic scenario modeling, multi-level difficulty design, and expert-verified annotations. Contribution/Results: Experiments reveal that even state-of-the-art reasoning models substantially underperform human experts on PHYBench. The benchmark dataset and evaluation framework are publicly released to advance research in physics-aware reasoning and causal inference.

Technology Category

Application Category

📝 Abstract

We introduce PHYBench, a novel, high-quality benchmark designed for evaluating reasoning capabilities of large language models (LLMs) in physical contexts. PHYBench consists of 500 meticulously curated physics problems based on real-world physical scenarios, designed to assess the ability of models to understand and reason about realistic physical processes. Covering mechanics, electromagnetism, thermodynamics, optics, modern physics, and advanced physics, the benchmark spans difficulty levels from high school exercises to undergraduate problems and Physics Olympiad challenges. Additionally, we propose the Expression Edit Distance (EED) Score, a novel evaluation metric based on the edit distance between mathematical expressions, which effectively captures differences in model reasoning processes and results beyond traditional binary scoring methods. We evaluate various LLMs on PHYBench and compare their performance with human experts. Our results reveal that even state-of-the-art reasoning models significantly lag behind human experts, highlighting their limitations and the need for improvement in complex physical reasoning scenarios. Our benchmark results and dataset are publicly available at https://phybench-official.github.io/phybench-demo/.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' physical reasoning in real-world scenarios

Assessing model capabilities across multiple physics disciplines

Measuring performance gaps between AI and human experts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel benchmark for physical reasoning evaluation

Expression Edit Distance metric for mathematical accuracy

Comprehensive physics problem coverage across difficulty levels

🔎 Similar Papers

No similar papers found.

Authors to Follow