FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing mathematical benchmarks (e.g., IMO) inadequately assess large language models’ (LLMs’) capabilities in advanced abstract algebraic reasoning. Method: We introduce FATE, a formalized algebraic evaluation benchmark comprising two subsets—FATE-H and FATE-X—each containing 100 problems spanning undergraduate to post-PhD qualifying exam difficulty in abstract and commutative algebra. FATE is the first to systematically formalize high-order algebraic concepts absent from Mathlib. We propose a two-stage evaluation framework that disentangles natural-language reasoning from formalization translation, and conduct systematic evaluations of mainstream LLMs using interactive theorem provers, identifying recurrent formalization error patterns. Results: State-of-the-art models achieve only 3% pass@64 on FATE-H and 0% on FATE-X—substantially lower than their performance on competition mathematics—demonstrating FATE’s high difficulty and its critical value in advancing research on formal mathematical reasoning.

Technology Category

Application Category

📝 Abstract
Recent advances in large language models (LLMs) have demonstrated impressive capabilities in formal theorem proving, particularly on contest-based mathematical benchmarks like the IMO. However, these contests do not reflect the depth, breadth, and abstraction of modern mathematical research. To bridge this gap, we introduce FATE (Formal Algebra Theorem Evaluation), a new benchmark series in formal algebra designed to chart a course toward advanced mathematical reasoning. We present two new components, FATE-H and FATE-X, each with 100 problems in abstract and commutative algebra. The FATE series spans a difficulty spectrum from undergraduate exercises to problems exceeding PhD qualifying exams. Notably, FATE-X is the first formal benchmark to surpass both PhD-level exam difficulty and the coverage of the Mathlib library. Our evaluations of state-of-the-art LLM provers on this new benchmark reveal a stark performance gap compared to contest math: the best model achieves only 3% (pass@64) accuracy on FATE-H and 0% on FATE-X. Our two-stage evaluation reveals that models'natural-language reasoning is notably more accurate than their ability to formalize this reasoning. We systematically classify the common errors that arise during this formalization process. Furthermore, a comparative study shows that a specialized prover can exhibit less effective reflection than general-purpose models, reducing its accuracy at the natural-language stage. We believe FATE provides a robust and challenging benchmark that establishes essential checkpoints on the path toward research-level formal mathematical reasoning.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to formalize advanced algebra beyond contest mathematics
Bridging the gap between contest performance and research-level mathematical reasoning
Assessing formal theorem proving capabilities across undergraduate to PhD-level difficulty
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces FATE benchmark series for formal algebra
Creates FATE-H and FATE-X with 100 problems each
Evaluates LLM provers using two-stage assessment method
🔎 Similar Papers
J
Jiedong Jiang
Westlake Institute for Advanced Study, Westlake University
W
Wanyi He
Peking University
Yuefeng Wang
Yuefeng Wang
Northeastern University
Open-World Object Detection
G
Guoxiong Gao
Peking University
Y
Yongle Hu
Peking University
J
Jingting Wang
Peking University
N
Nailing Guan
Peking University
P
Peihao Wu
Ubiquant
C
Chunbo Dai
Ubiquant
L
Liang Xiao
New Cornerstone Science Laboratory, School of Mathematical Sciences, Peking University
B
Bin Dong
Beijing International Center for Mathematical Research and the New Cornerstone Science Laboratory, Peking University; Center for Machine Learning Research, Peking University; Center for Intelligent Computing, Great Bay Institute for Advanced Study, Great Bay University