Guided Verifier: Collaborative Multimodal Reasoning via Dynamic Process Supervision

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that multimodal large language models often suffer from error propagation during complex reasoning due to the absence of intermediate supervision, resulting in noisy optimization signals and suboptimal performance. To mitigate this, we propose the Guided Verifier framework, which integrates a dynamic verifier with a policy model to detect inconsistencies in real time and provide directional guidance throughout the reasoning process, thereby enabling process-level supervision. We introduce a novel dynamic process supervision mechanism, construct the CoRe dataset comprising process-level negative samples and correctly guided reasoning trajectories, and further incorporate reinforcement learning, multimodal hallucination-aware data synthesis, and an interactive verifier-policy architecture. Our approach achieves significant performance gains on the MathVista, MathVerse, and MMMU benchmarks, with an 8B-parameter model attaining state-of-the-art results.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning (RL) has emerged as a pivotal mechanism for enhancing the complex reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevailing paradigms typically rely on solitary rollout strategies where the model works alone. This lack of intermediate oversight renders the reasoning process susceptible to error propagation, where early logical deviations cascade into irreversible failures, resulting in noisy optimization signals. In this paper, we propose the \textbf{Guided Verifier} framework to address these structural limitations. Moving beyond passive terminal rewards, we introduce a dynamic verifier that actively co-solves tasks alongside the policy. During the rollout phase, this verifier interacts with the policy model in real-time, detecting inconsistencies and providing directional signals to steer the model toward valid trajectories. To facilitate this, we develop a specialized data synthesis pipeline targeting multimodal hallucinations, constructing \textbf{CoRe} dataset of process-level negatives and \textbf{Co}rrect-guide \textbf{Re}asoning trajectories to train the guided verifier. Extensive experiments on MathVista, MathVerse and MMMU indicate that by allocating compute to collaborative inference and dynamic verification, an 8B-parameter model can achieve strong performance.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Reasoning
Error Propagation
Process Supervision
Reinforcement Learning
Hallucination
Innovation

Methods, ideas, or system contributions that make the work stand out.

Guided Verifier
Dynamic Process Supervision
Multimodal Reasoning
Collaborative Inference
Reinforcement Learning
🔎 Similar Papers
No similar papers found.
L
Lingzhuang Sun
University of Chinese Academy of Sciences, Beijing, China
R
Ruitong Liu
Peking University, Beijing, China
Y
Yuxia Zhu
Peking University, Beijing, China
Xiaohan Xu
Xiaohan Xu
The University of Hong Kong
Knowledge GraphLarge Language ModelText-to-SQL
Jingxuan Wei
Jingxuan Wei
University of Chinese Academy of Sciences
Natural Language ProcessingMultimodal Learning
X
Xiangxiang Zhang
University of Chinese Academy of Sciences, Beijing, China
B
Bihui Yu
University of Chinese Academy of Sciences, Beijing, China
Wentao Zhang
Wentao Zhang
Institute of Physics, Chinese Academy of Sciences
photoemissionsuperconductivitycupratehtsctime-resolved