MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from logical inconsistencies and incomplete answers in complex, multi-step mathematical reasoning. Method: We propose MM-PRM—the first scalable multimodal process reward model—leveraging fully automated Monte Carlo tree search (MCTS) to generate fine-grained, 700K-step reasoning annotations for strong intermediate-step supervision. We further introduce MM-K12, the first 10K-scale, verifiable multimodal K12 mathematics dataset, and identify critical training factors: soft-labeling, low learning rates, and path diversity. Contribution/Results: MM-PRM achieves significant accuracy gains on MM-K12 and establishes new state-of-the-art performance on cross-domain benchmarks including OlympiadBench and MathVista. All code and data are publicly released.

Technology Category

Application Category

📝 Abstract

While Multimodal Large Language Models (MLLMs) have achieved impressive progress in vision-language understanding, they still struggle with complex multi-step reasoning, often producing logically inconsistent or partially correct solutions. A key limitation lies in the lack of fine-grained supervision over intermediate reasoning steps. To address this, we propose MM-PRM, a process reward model trained within a fully automated, scalable framework. We first build MM-Policy, a strong multimodal model trained on diverse mathematical reasoning data. Then, we construct MM-K12, a curated dataset of 10,000 multimodal math problems with verifiable answers, which serves as seed data. Leveraging a Monte Carlo Tree Search (MCTS)-based pipeline, we generate over 700k step-level annotations without human labeling. The resulting PRM is used to score candidate reasoning paths in the Best-of-N inference setup and achieves significant improvements across both in-domain (MM-K12 test set) and out-of-domain (OlympiadBench, MathVista, etc.) benchmarks. Further analysis confirms the effectiveness of soft labels, smaller learning rates, and path diversity in optimizing PRM performance. MM-PRM demonstrates that process supervision is a powerful tool for enhancing the logical robustness of multimodal reasoning systems. We release all our codes and data at https://github.com/ModalMinds/MM-PRM.

Problem

Research questions and friction points this paper is trying to address.

Enhancing multimodal math reasoning with step-level supervision

Addressing lack of fine-grained supervision in MLLMs

Improving logical robustness via automated process reward model

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated scalable framework for step-level supervision

Monte Carlo Tree Search generates step annotations

Process reward model improves multimodal reasoning

🔎 Similar Papers

No similar papers found.

Authors to Follow