MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal mathematical benchmarks predominantly focus on single-image scenarios, failing to assess models’ mathematical reasoning capabilities in realistic multi-visual contexts. Method: We introduce MV-MATH, the first K–12 mathematical reasoning benchmark for image–text interleaved, multi-image settings, comprising 2,009 problems across 11 subjects, three difficulty levels, and diverse question types. Leveraging human-curated, education-driven data construction, we systematically define and evaluate multimodal large language models’ (MLLMs’) mathematical reasoning under multi-visual conditions, incorporating multi-granularity annotations, cross-modal alignment design, and fine-grained evaluation protocols. Contribution/Results: Experiments reveal that state-of-the-art MLLMs achieve less than 45% average accuracy on MV-MATH—substantially below human performance—uncovering critical bottlenecks and characteristic error patterns. MV-MATH thus establishes a rigorous, diagnostic benchmark to advance research in multimodal mathematical reasoning.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have shown promising capabilities in mathematical reasoning within visual contexts across various datasets. However, most existing multimodal math benchmarks are limited to single-visual contexts, which diverges from the multi-visual scenarios commonly encountered in real-world mathematical applications. To address this gap, we introduce MV-MATH: a meticulously curated dataset of 2,009 high-quality mathematical problems. Each problem integrates multiple images interleaved with text, derived from authentic K-12 scenarios, and enriched with detailed annotations. MV-MATH includes multiple-choice, free-form, and multi-step questions, covering 11 subject areas across 3 difficulty levels, and serves as a comprehensive and rigorous benchmark for assessing MLLMs' mathematical reasoning in multi-visual contexts. Through extensive experimentation, we observe that MLLMs encounter substantial challenges in multi-visual math tasks, with a considerable performance gap relative to human capabilities on MV-MATH. Furthermore, we analyze the performance and error patterns of various models, providing insights into MLLMs' mathematical reasoning capabilities within multi-visual settings.
Problem

Research questions and friction points this paper is trying to address.

Evaluates multimodal math reasoning in multi-visual contexts.
Introduces MV-MATH dataset for real-world K-12 scenarios.
Analyzes MLLMs' challenges and performance gaps in math tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

MV-MATH dataset for multi-visual math reasoning
Integrates multiple images with text annotations
Assesses MLLMs in multi-visual mathematical contexts
🔎 Similar Papers
No similar papers found.
Peijie Wang
Peijie Wang
Institute of Automation Chinese Academy of Sciences
Multimodal LLMsmath reasoning
Zhongzhi Li
Zhongzhi Li
Institute of Automation, Chinese Academy of Sciences
LLMNLPMath Reason
F
Fei Yin
MAIS, Institute of Automation of Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
D
Dekang Ran
MAIS, Institute of Automation of Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
C
Chenglin Liu
MAIS, Institute of Automation of Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences