🤖 AI Summary
This study evaluates the reliability of large language models (LLMs) in mathematical reasoning within graph theory and examines how their performance varies with problem difficulty. To this end, the authors introduce GTBench, the first curriculum-structured benchmark for graph theory, comprising 63 progressively challenging problems ranging from undergraduate-level definitions and algorithm tracing to graduate-level proofs. They also propose a multidimensional evaluation protocol that integrates human expert judgments with LLM-as-judge assessments. Experimental results show that GPT-5 achieves 95.8% accuracy on foundational tasks and maintains 82% accuracy on advanced proofs, whereas other models, such as Llama, drop to 0% accuracy on the most difficult problems under human evaluation—highlighting significant limitations of current LLMs in complex mathematical reasoning and notable discrepancies between human and automated scoring.
📝 Abstract
Large language models (LLMs) are increasingly used as self-study assistants in technical disciplines, yet their reliability as mathematical reasoning assistants remains poorly understood. We introduce GTBench, a curriculum-grounded benchmark for evaluating LLMs as mathematical research assistants in graph theory, comprising 63 problems organized into three groups of increasing difficulty: undergraduate definitions and basic properties (Group 1), algorithm tracing and structural reasoning (Group 2), and graduate-level proof construction (Group 3). Problems are sourced from verified academic materials including Diestel's Graph Theory. We evaluate five frontier models -- GPT-5, Claude Sonnet 4.6, Gemini 2.5 Flash-Lite, Llama 3.3 70B, and Mistral Large 3 -- under zero-shot and chain-of-thought prompting, using exact-match and LLM-as-judge evaluation for Groups 1 and 2, and a hybrid human expert and LLM-as-judge protocol for Group 3. Our results reveal a pronounced performance hierarchy: GPT-5 approaches ceiling on Group 1 (95.8% zero-shot) and maintains meaningful accuracy on graduate proofs (82%), while all other models degrade substantially with difficulty, with Llama achieving 0% under human evaluation on Group 3 zero-shot. Failure mode analysis shows that correct algorithm, wrong execution errors dominate Groups 1 and 2, while Group 3 additionally surfaces incomplete reasoning failures and reveals systematic disagreement between human evaluators and the automated judge, particularly on verbose or near-complete proofs (kappa = 0.48-0.83 across human pairs). GTBench provides the first curriculum-grounded evaluation framework for graph-theoretic reasoning in LLMs, with direct implications for the governance of AI tools in mathematical education and scientific research.