LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs

📅 2024-06-07

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing LLM mathematical reasoning benchmarks suffer from narrow topic coverage and lack rigorous validation of reasoning authenticity. Method: We propose MaTT—the first structured, fine-grained Mathematical Topic Tree benchmark—comprising 1,958 cross-domain problems annotated with hierarchical topic chains. Our novel evaluation framework integrates multi-round human annotation, hierarchical topic modeling, chain-of-thought (CoT) analysis, and multi-model comparison (including CoT-based models). Contribution/Results: GPT-4 achieves only 54% accuracy on multiple-choice tasks, dropping sharply by 24.2 percentage points when answer options are removed. Moreover, only 53.3% of correct answers are accompanied by fully accurate, complete reasoning traces—revealing fundamental failures in option-free reasoning and pervasive answer–explanation inconsistency. MaTT establishes a new, interpretable, attributable, and cross-topic evaluation paradigm for LLM mathematical reasoning.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) demonstrate impressive capabilities in mathematical reasoning. However, despite these achievements, current evaluations are mostly limited to specific mathematical topics, and it remains unclear whether LLMs are genuinely engaging in reasoning. To address these gaps, we present the Mathematical Topics Tree (MaTT) benchmark, a challenging and structured benchmark that offers 1,958 questions across a wide array of mathematical subjects, each paired with a detailed hierarchical chain of topics. Upon assessing different LLMs using the MaTT benchmark, we find that the most advanced model, GPT-4, achieved a mere 54% accuracy in a multiple-choice scenario. Interestingly, even when employing Chain-of-Thought prompting, we observe mostly no notable improvement. Moreover, LLMs accuracy dramatically reduced by up to 24.2 percentage point when the questions were presented without providing choices. Further detailed analysis of the LLMs' performance across a range of topics showed significant discrepancy even for closely related subtopics within the same general mathematical area. In an effort to pinpoint the reasons behind LLMs performances, we conducted a manual evaluation of the completeness and correctness of the explanations generated by GPT-4 when choices were available. Surprisingly, we find that in only 53.3% of the instances where the model provided a correct answer, the accompanying explanations were deemed complete and accurate, i.e., the model engaged in genuine reasoning.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' genuine reasoning in diverse math topics

Assessing accuracy drop in LLMs without multiple-choice options

Analyzing correctness of LLMs' explanations for math solutions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MaTT benchmark for LLM evaluation

Assesses LLMs across diverse mathematical topics

Analyzes reasoning quality via explanation correctness

🔎 Similar Papers

LLM Reading Tea Leaves: Automatically Evaluating Topic Models with Large Language Models