CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing large language models (LLMs) and vision-language models (VLMs) exhibit limited performance in visual-augmented mathematical reasoning—such as constructing auxiliary lines or plotting function graphs—due to the spatial abstraction gap in pure text-based reasoning and the low fidelity and controllability of generated images in multimodal models. Method: We propose CodePlot-CoT, a novel paradigm that embeds executable plotting code as “visual thinking” into the reasoning chain, enabling tight language–vision co-reasoning. To support this, we introduce Math-VR, the first bilingual mathematical visual reasoning dataset; design a code-driven Chain-of-Thought framework integrating visual-language understanding, code generation, and image rendering; and develop a dedicated image-to-code converter for precise parsing of complex mathematical diagrams. Contribution/Results: On the Math-VR benchmark, our approach achieves up to a 21% absolute improvement over strong baselines, significantly advancing visual-augmented mathematical reasoning. All code, models, and data are publicly released.

Technology Category

Application Category

📝 Abstract

Recent advances in Large Language Models (LLMs) and Vision Language Models (VLMs) have shown significant progress in mathematical reasoning, yet they still face a critical bottleneck with problems requiring visual assistance, such as drawing auxiliary lines or plotting functions to solve the problems. Most LLMs and VLMs are constrained to text-only reasoning chains, while multimodal unified models that can generate interleaved text and images lack the necessary precision and controllability for such tasks. To address this, we propose CodePlot-CoT, a code-driven Chain-of-Thought paradigm for "thinking with images" in mathematics. Our approach leverages the VLM to generate text reasoning as well as executable plotting code, which is then rendered into images as "visual thought", to solve mathematical problems. To achieve this, we first construct Math-VR, the first large-scale, bilingual dataset and benchmark for Mathematics problems with Visual Reasoning, comprising 178K samples. Second, to create high-quality training data, we develop a state-of-the-art image-to-code converter specialized for parsing complex mathematical figures into codes. Finally, using these training data, we train the CodePlot-CoT model for solving mathematical problems. Experimental results show that our model achieves up to 21% increase over base model on our new benchmark, fully validating the efficacy of our proposed code-driven reasoning paradigm. Our work opens a new direction for multimodal mathematical reasoning and provides the community with the first large-scale dataset, comprehensive benchmark, and strong approach for such problems. To facilitate future research, we make our datasets, code, and pretrained models publicly available at https://github.com/HKU-MMLab/Math-VR-CodePlot-CoT.

Problem

Research questions and friction points this paper is trying to address.

Solving mathematical problems requiring visual reasoning assistance

Addressing precision limitations in multimodal text-image generation models

Creating executable code-driven visualizations for mathematical problem-solving

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates executable plotting code for visual reasoning

Creates large-scale bilingual dataset for mathematical visual problems

Develops specialized image-to-code converter for training data

🔎 Similar Papers

No similar papers found.

Authors to Follow