🤖 AI Summary
This work addresses the limited mathematical formula comprehension of vision-language models (VLMs) in mathematical reasoning tasks. We propose a lightweight multimodal training framework that renders LaTeX formulas into high-fidelity images and constructs structured, Chain-of-Thought–guided image-text pairs, jointly optimizing image-text alignment and prompt-guided reasoning. Our key contributions are: (i) identifying rendering fidelity and prompt structure as critical performance levers; and (ii) introducing a simple yet effective text-to-visual enhancement paradigm. Evaluated on mathematical reasoning benchmarks—including MathVista and AMPS—our method achieves state-of-the-art or competitive performance relative to leading closed-source models. Moreover, it generalizes effectively across diverse multimodal understanding tasks: achieving an average +18.3% improvement on MMMU, ChartQA, and DocVQA, significantly outperforming same-scale open-source VLMs while preserving strong general-purpose vision-language capabilities.
📝 Abstract
We present a lightweight yet effective pipeline for training vision-language models to solve math problems by rendering LaTeX encoded equations into images and pairing them with structured chain-of-thought prompts. This simple text-to-vision augmentation enables compact multimodal architectures to achieve state-of-the-art reasoning accuracy. Through systematic ablations, we find that rendering fidelity and prompt design are the primary drivers of performance. Despite its simplicity, our approach consistently matches or surpasses both open-source and proprietary math-focused vision-language solvers on widely used benchmarks, while preserving broad general-domain competence - showing gains on tasks such as MMMU, ChartQA, and DocVQA of up to 20%.