🤖 AI Summary
This study investigates large language models’ (LLMs) capacity to comprehend real-world contexts in mathematical word problem solving and its implications for mathematics education. Method: Employing a tripartite approach—technical survey, systematic literature review, and empirical evaluation—we analyze a curated corpus of 213 studies and benchmark four LLMs (GPT-3.5-turbo, GPT-4o-mini, GPT-4.1, and o3) on 287 word problems, including official PISA items. Contribution/Results: While LLMs achieve near-perfect accuracy on conventional problems (e.g., scoring 100% on the 20 PISA items), they consistently fail on tasks requiring authentic world reasoning or involving implausible scenarios—revealing a fundamental limitation: reliance on superficial pattern matching rather than genuine contextual understanding. This work is the first to systematically deconstruct LLM word-problem-solving mechanisms from a mathematics education perspective, exposing a critical conceptual misalignment between “mathematical reasoning” as construed in AI research versus educational practice. It provides both theoretical caution and empirical grounding for designing pedagogically sound intelligent tutoring systems.
📝 Abstract
The progress of Large Language Models (LLMs) like ChatGPT raises the question of how they can be integrated into education. One hope is that they can support mathematics learning, including word-problem solving. Since LLMs can handle textual input with ease, they appear well-suited for solving mathematical word problems. Yet their real competence, whether they can make sense of the real-world context, and the implications for classrooms remain unclear. We conducted a scoping review from a mathematics-education perspective, including three parts: a technical overview, a systematic review of word problems used in research, and a state-of-the-art empirical evaluation of LLMs on mathematical word problems. First, in the technical overview, we contrast the conceptualization of word problems and their solution processes between LLMs and students. In computer-science research this is typically labeled mathematical reasoning, a term that does not align with usage in mathematics education. Second, our literature review of 213 studies shows that the most popular word-problem corpora are dominated by s-problems, which do not require a consideration of realities of their real-world context. Finally, our evaluation of GPT-3.5-turbo, GPT-4o-mini, GPT-4.1, and o3 on 287 word problems shows that most recent LLMs solve these s-problems with near-perfect accuracy, including a perfect score on 20 problems from PISA. LLMs still showed weaknesses in tackling problems where the real-world context is problematic or non-sensical. In sum, we argue based on all three aspects that LLMs have mastered a superficial solution process but do not make sense of word problems, which potentially limits their value as instructional tools in mathematics classrooms.