🤖 AI Summary
This work investigates the out-of-distribution generalization of large language models (LLMs) on arithmetic proof tasks, focusing on challenges posed by increasing proof depth, width, and nonlinear topological structure. To this end, we propose MathGAP—the first synthetic benchmark framework enabling precise, independent control over proof-tree structure (depth, width, topology). MathGAP synthesizes proofs via formal arithmetic rule-driven chain reasoning, structured template sampling, and controlled perturbation injection, thereby avoiding training contamination and bias toward trivial linear paths. Experiments reveal that mainstream LLMs exhibit sharp performance degradation with rising proof complexity—especially under nonlinear topologies—and are highly sensitive to minor premise-order permutations. Yet they retain partial capability on complex problems, indicating noisy, structurally fragile symbolic reasoning. This work establishes a novel paradigm and benchmark for rigorously evaluating and advancing deep symbolic reasoning in LLMs.
📝 Abstract
Large language models (LLMs) can solve arithmetic word problems with high accuracy, but little is known about how well they generalize to more complex problems. This is difficult to study, as (i) much of the available evaluation data has already been seen by the most capable models during training, and (ii) existing benchmarks do not capture how problem proofs may be arbitrarily complex in various ways. In this paper, we present a data-generation framework for evaluating LLMs on problems with arbitrarily complex arithmetic proofs, called MathGAP. MathGAP generates problem statements and chain-of-thought reasoning traces according to specifications about their arithmetic proof structure, enabling systematic studies on easy-to-hard generalization with respect to complexity of proof trees. Using MathGAP, we find that LLMs show a significant decrease in performance as proofs get deeper and wider. This effect is more pronounced in complex, nonlinear proof structures, which are challenging even for the most capable models. The models are also sensitive to simple changes in sentence ordering. However, they remain capable of solving some complex problems, suggesting that reasoning generalization is noisy.