PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

📅 2025-04-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models exhibit significantly weaker causal and procedural reasoning capabilities in real-world physical scenarios compared to human experts. Method: We introduce PHYBench, the first comprehensive benchmark for physics-grounded reasoning, covering six domains—including mechanics and electromagnetism—with 500 hierarchically structured problems. We propose Expression Edit Distance (EED) as a fine-grained evaluation metric, enabling quantitative, step-by-step comparison between model-generated reasoning paths and expert solutions. PHYBench ensures validity and reliability through realistic scenario modeling, multi-level difficulty design, and expert-verified annotations. Contribution/Results: Experiments reveal that even state-of-the-art reasoning models substantially underperform human experts on PHYBench. The benchmark dataset and evaluation framework are publicly released to advance research in physics-aware reasoning and causal inference.

Technology Category

Application Category

📝 Abstract
We introduce PHYBench, a novel, high-quality benchmark designed for evaluating reasoning capabilities of large language models (LLMs) in physical contexts. PHYBench consists of 500 meticulously curated physics problems based on real-world physical scenarios, designed to assess the ability of models to understand and reason about realistic physical processes. Covering mechanics, electromagnetism, thermodynamics, optics, modern physics, and advanced physics, the benchmark spans difficulty levels from high school exercises to undergraduate problems and Physics Olympiad challenges. Additionally, we propose the Expression Edit Distance (EED) Score, a novel evaluation metric based on the edit distance between mathematical expressions, which effectively captures differences in model reasoning processes and results beyond traditional binary scoring methods. We evaluate various LLMs on PHYBench and compare their performance with human experts. Our results reveal that even state-of-the-art reasoning models significantly lag behind human experts, highlighting their limitations and the need for improvement in complex physical reasoning scenarios. Our benchmark results and dataset are publicly available at https://phybench-official.github.io/phybench-demo/.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' physical reasoning in real-world scenarios
Assessing model capabilities across multiple physics disciplines
Measuring performance gaps between AI and human experts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel benchmark for physical reasoning evaluation
Expression Edit Distance metric for mathematical accuracy
Comprehensive physics problem coverage across difficulty levels
🔎 Similar Papers
No similar papers found.
S
Shi Qiu
School of Physics, Peking University
Shaoyang Guo
Shaoyang Guo
Peking University
PhysicsAI
Zhuo-Yang Song
Zhuo-Yang Song
Undergraduated Student of Physcis, Peking University
hep-phCs-CL
Y
Yunbo Sun
School of Physics, Peking University
Zeyu Cai
Zeyu Cai
Institute of Heavy Ion Physics, Peking University
AI for SciencePlasma PhysicsAI AgentsNumber Theory
J
Jiashen Wei
School of Physics, Peking University
T
Tianyu Luo
School of Physics, Peking University
Y
Yixuan Yin
School of Physics, Peking University
H
Haoxu Zhang
School of Physics, Peking University
Y
Yi Hu
Institute for Artificial Intelligence, Peking University
C
Chenyang Wang
School of Physics, Peking University
C
Chencheng Tang
School of Physics, Peking University
H
Haoling Chang
School of Physics, Peking University
Q
Qi Liu
School of Physics, Peking University
Z
Ziheng Zhou
School of Physics, Peking University
T
Tianyu Zhang
School of Physics, Peking University
J
Jingtian Zhang
School of Physics, Peking University
Z
Zhangyi Liu
School of Physics, Peking University
Minghao Li
Minghao Li
Beihang University
Natural Language Processing
Y
Yuku Zhang
School of Physics, Peking University
B
Boxuan Jing
School of Physics, Peking University
X
Xianqi Yin
School of Physics, Peking University
Y
Yutong Ren
School of Physics, Peking University
Z
Zizhuo Fu
Institute for Artificial Intelligence, Peking University
Weike Wang
Weike Wang
School of Physics, Peking University
X
Xudong Tian
School of Physics, Peking University
A
Anqi Lv
School of Physics, Peking University
L
Laifu Man
School of Physics, Peking University
J
Jianxiang Li
School of Physics, Peking University
F
Feiyu Tao
School of Physics, Peking University
Q
Qihua Sun
School of Physics, Peking University
Z
Zhou Liang
School of Physics, Peking University
Y
Yu-Song Mu
School of Physics, Peking University
Z
Zhongxuan Li
School of Physics, Peking University
J
Jing-Jun Zhang
School of Physics, Peking University
S
Shutao Zhang
School of Physics, Peking University
X
Xiaotian Li
School of Physics, Peking University
X
Xingqi Xia
School of Physics, Peking University
J
Jiawei Lin
School of Physics, Peking University
Zheyu Shen
Zheyu Shen
Graduate Student of Electronic and Computer Engineering, University of Maryland
Machine Learning SystemLarge Language Model
J
Jiahang Chen
School of Physics, Peking University
Q
Qiuhao Xiong
School of Physics, Peking University
B
Binran Wang
School of Physics, Peking University
F
Fengyuan Wang
School of Physics, Peking University
Z
Ziyang Ni
School of Physics, Peking University
B
Bohan Zhang
Yuanpei College, Peking University
Fan Cui
Fan Cui
School of Integrated Circuits, Peking University
C
Changkun Shao
School of Physics, Peking University
Qing-Hong Cao
Qing-Hong Cao
Peking University
high energy physics
M
Ming-xing Luo
Beijing Computational Science Research Center
Muhan Zhang
Muhan Zhang
Peking University
Machine LearningGraph Neural NetworkLarge Language Models
Hua Xing Zhu
Hua Xing Zhu
Peking University
Quantum Field TheoryQCDEffective Field Theory