🤖 AI Summary
Multimodal large language models (MLLMs) exhibit weak mathematical reasoning capabilities, hindering their path toward artificial general intelligence (AGI).
Method: We systematically survey over 200 studies published since 2021, employing bibliometric and technical evolution analyses to synthesize key advances in Math-LLM architectures, cross-modal alignment, symbolic reasoning enhancement, synthetic data construction, and evaluation protocols.
Contribution/Results: We introduce the first comprehensive landscape of MLLM-based mathematical reasoning—spanning benchmarking, methodological frameworks, and fundamental challenges. A novel three-dimensional taxonomy for multimodal mathematical reasoning is proposed; five core AGI-limiting challenges are identified; and the first unified analytical paradigm integrating textual, formulaic, and diagrammatic inputs is established. This authoritative survey clarifies critical performance bottlenecks and evolutionary trajectories, providing theoretical foundations and practical guidelines to enhance robustness and interpretability of MLLMs on complex mathematical tasks.
📝 Abstract
Mathematical reasoning, a core aspect of human cognition, is vital across many domains, from educational problem-solving to scientific advancements. As artificial general intelligence (AGI) progresses, integrating large language models (LLMs) with mathematical reasoning tasks is becoming increasingly significant. This survey provides the first comprehensive analysis of mathematical reasoning in the era of multimodal large language models (MLLMs). We review over 200 studies published since 2021, and examine the state-of-the-art developments in Math-LLMs, with a focus on multimodal settings. We categorize the field into three dimensions: benchmarks, methodologies, and challenges. In particular, we explore multimodal mathematical reasoning pipeline, as well as the role of (M)LLMs and the associated methodologies. Finally, we identify five major challenges hindering the realization of AGI in this domain, offering insights into the future direction for enhancing multimodal reasoning capabilities. This survey serves as a critical resource for the research community in advancing the capabilities of LLMs to tackle complex multimodal reasoning tasks.