Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models

📅 2025-06-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large reasoning models exhibit fundamental deficiencies in mathematical proving—including low correctness, unreliable single-step logic, incomplete reasoning processes, and hallucinations. Method: We construct RFMDataset (200 formal proof problems) and propose mathematical proof as a “litmus test” for systematically diagnosing logical capabilities; leveraging human annotation and multi-dimensional failure analysis, we identify ten fine-grained error categories and demonstrate that self-reflection fails to mitigate core logical flaws—highlighting the necessity of formal logic training. Contribution/Results: Experiments reveal that state-of-the-art models achieve <20% accuracy even on basic proof tasks, while prevailing evaluation benchmarks suffer from severe data leakage. This work establishes a novel paradigm for rigorous, trustworthy evaluation of reasoning models and provides empirical foundations for enhancing their logical fidelity.

Technology Category

Application Category

📝 Abstract
Large reasoning models (e.g., R1, o3) have demonstrated remarkable mathematical problem-solving abilities. However, the high reported accuracy of these advanced models on popular datasets, reliance on purely numerical evaluation and potential benchmark leakage, often masks their true reasoning shortcomings. To address this, we propose leveraging the inherent rigor and methodological complexity of mathematical proofs as a diagnostic tool to expose these hidden failures. Specifically, we introduce the RFMDataset (Reveal Failure Modes), a collection of 200 diverse mathematical proof problems, and thoroughly evaluate advanced models'performance on it. Our in-depth analysis of their failures uncovers 10 fine-grained error types, which shows fundamental limitations in current large reasoning models: 1) large reasoning models grapple profoundly with mathematical proofs, with some generating entirely correct proofs for less than 20% of problems and failing even on basic ones; 2) models exhibit a diverse spectrum of reasoning failures, prominently demonstrating the lack of guarantees for the correctness and rigor of single-step reasoning; and 3) models show hallucination and incompleteness during the reasoning process. Our findings reveal that models'self-reflection is insufficient to resolve the current logical dilemmas, necessitating formalized and fine-grained logical training.
Problem

Research questions and friction points this paper is trying to address.

Exposing hidden reasoning failures in large models using mathematical proofs
Evaluating models' proof-solving abilities with the RFMDataset to uncover error types
Identifying limitations like incorrect single-step reasoning and hallucination in models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging mathematical proofs for diagnostic evaluation
Introducing RFMDataset to expose reasoning failures
Identifying 10 fine-grained error types in models
🔎 Similar Papers
No similar papers found.
D
Dadi Guo
Hong Kong University of Science and Technology
Jiayu Liu
Jiayu Liu
University of Science and Technology of China
Artificial IntelligenceKnowledge LearningMathematical ReasoningNatural Language Processing
Zhiyuan Fan
Zhiyuan Fan
PhD Student, MIT
reinforcement learningcomputational game theory
Zhitao He
Zhitao He
Hong Kong University of Science and Technology
Language ModelLanguage AgentMultimodal
H
Haoran Li
Hong Kong University of Science and Technology
Y
Yumeng Wang
Hong Kong University of Science and Technology
Y
Yi R. Fung
Hong Kong University of Science and Technology