🤖 AI Summary
The absence of standardized benchmarks for evaluating AI reasoning capabilities in high-energy theory and cosmology hinders progress in applying AI to theoretical physics.
Method: We introduce TPBench—the first domain-specific benchmark for theoretical physics—comprising 57 original, expert-crafted problems spanning undergraduate to cutting-edge research difficulty. We propose the first systematic evaluation framework for AI reasoning in this domain, featuring an automated, verifiable scoring mechanism. Using zero-shot and few-shot prompting, we evaluate multilingual large language models (e.g., GPT-4o, Qwen, Llama, o3-mini), complemented by symbolic reasoning analysis and failure attribution.
Results: State-of-the-art models exhibit robust performance on undergraduate-level problems but consistently fail on research-level tasks, revealing fundamental limitations in physical intuition, multi-step symbolic derivation, and conceptual synthesis. All problems, reference solutions, evaluation scripts, and dynamically updated datasets are publicly released at tpbench.org.
📝 Abstract
We introduce a benchmark to evaluate the capability of AI to solve problems in theoretical physics, focusing on high-energy theory and cosmology. The first iteration of our benchmark consists of 57 problems of varying difficulty, from undergraduate to research level. These problems are novel in the sense that they do not come from public problem collections. We evaluate our data set on various open and closed language models, including o3-mini, o1, DeepSeek-R1, GPT-4o and versions of Llama and Qwen. While we find impressive progress in model performance with the most recent models, our research-level difficulty problems are mostly unsolved. We address challenges of auto-verifiability and grading, and discuss common failure modes. While currently state-of-the art models are still of limited use for researchers, our results show that AI assisted theoretical physics research may become possible in the near future. We discuss the main obstacles towards this goal and possible strategies to overcome them. The public problems and solutions, results for various models, and updates to the data set and score distribution, are available on the website of the dataset tpbench.org.