VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the limited efficacy of current multimodal large language models in actively invoking visualization tools to support mathematical reasoning, particularly their difficulty in leveraging self-generated plots. To this end, the authors introduce a benchmark dataset comprising 1,168 bilingual (Chinese–English) multiple-choice questions that integrate textual and visual elements, specifically designed to evaluate the ability to solve algebraic and calculus problems by drawing function graphs to reveal key features such as intersections, extrema, and asymptotes. Combining authentic exam questions with LLM-synthesized data, this dataset enables the first systematic assessment of models’ capacity for vision-guided reasoning following active tool invocation, moving beyond conventional static image understanding paradigms. Experimental results show that, even on problems where visual cues naturally suggest a solution path, state-of-the-art models perform significantly better when directly analyzing the question than when utilizing generated visualizations.

📝 Abstract

Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool's output, specifically when they rely on visual aids. This gap is especially important because real engineering and scientific workflows often rely on visualization tools for analysis, validation, and decision-making. To study this discrepancy, we introduce VAMPS (Visual-Assisted Mathematical Problem Solving), a benchmark for graph-assisted mathematics. VAMPS contains 1,168 multimodal, bilingual multiple-choice question-answer pairs drawn from Iranian University Entrance Exam algebra and calculus problems and expanded with human-reviewed LLM-generated synthetic variants, all selected so that plotting provides a natural solution strategy by revealing intersections, extrema, asymptotes, etc. Designed for both benchmarking and diagnosis, VAMPS goes beyond prior multimodal benchmarks that primarily evaluate reasoning over fixed visual inputs by testing whether a model can benefit from constructing a useful graph and grounding its answer in the resulting visualization. Overall, we found that across a diverse set of models, direct analytical solving surprisingly outperforms tool-enabled visual solving, even on problems where plotting is a natural strategy.

Problem

Research questions and friction points this paper is trying to address.

multimodal large language models

visual-assisted reasoning

mathematical problem solving

tool-augmented reasoning

visualization

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual-assisted reasoning

multimodal benchmark

tool-augmented LLMs