🤖 AI Summary
Existing process reward models in multimodal reasoning typically employ heuristic, equal-weight rewards across constraint dimensions—such as visual grounding and logical consistency—which can compromise overall reasoning reliability when stronger dimensions mask deficiencies in weaker ones. To address this limitation, this work proposes a worst-dimension optimization strategy that integrates dimension-aware evaluation and dynamic weight adjustment within the process reward framework. By selectively reinforcing the weakest dimension along the reasoning path, the approach enhances robustness and accuracy under complex multimodal constraints. Experimental results demonstrate that this method effectively mitigates cascading failures caused by the breakdown of any single dimension, thereby preserving the integrity and reliability of the entire reasoning process.
📝 Abstract
Multimodal reasoning requires a path that retains integrity over a wide range of constraints, from visual grounding to logic consistency. However, the current Process Reward Models focus on heuristically defined rewards that equally weigh these factors, which may lead to the concealment of individual dimension failures by the dominating factors, without guaranteeing the validity of the reasoning process in general.