🤖 AI Summary
Multimodal visual reasoning models often suffer from training instability inherent in end-to-end reinforcement learning (RL) and cognitive rigidity induced by supervised fine-tuning (SFT), limiting their adaptability to complex real-world scenarios. To address this, we propose GRiP—a Guided Reasoning and Perception framework that enables hierarchical, interpretable, vision-grounded reasoning through explicit guidance of perceptual focus and logical inference paths. Methodologically, GRiP introduces a saliency-weighted IoU reward and a multi-heuristic reward mechanism, built upon Qwen2.5-VL-7B. It adopts a two-stage training paradigm: first, SFT establishes foundational capabilities; second, cognition-enhanced RL jointly optimizes visual grounding and logical reasoning. Evaluated on challenging benchmarks—including TreeBench and V* Bench—GRiP achieves state-of-the-art performance among open-source models, significantly improving both accuracy and cognitive flexibility in complex visual reasoning tasks.
📝 Abstract
Models capable of "thinking with images" by dynamically grounding their reasoning in visual evidence represent a major leap in multimodal AI. However, replicating and advancing this ability is non-trivial, with current methods often trapped between the instability of end-to-end reinforcement learning (RL) and the rigidity of supervised fine-tuning (SFT). This leads to models that either struggle to learn or lack the cognitive flexibility required for complex, real-world scenes. To navigate this dilemma, we introduce GRiP (Guided Reasoning and Perception), a novel two-stage training framework that cultivates robust and flexible visual grounded reasoning by explicitly guiding the model's perceptual focus and logical pathways. GRiP's core lies in its cognitive-enhanced RL stage, which features two key innovations: (1) a Salience-Weighted IoU Reward that incentivizes the model to prioritize the localization of mission-critical objects over trivial distractors, and (2) a Multi-Heuristic Reward that encourages cognitive flexibility by rewarding diverse yet logically valid reasoning pathways. Initialized from the Qwen2.5-VL-7B model, GRiP demonstrates significant performance gains across multiple challenging benchmarks. It achieves state-of-the-art results among open-source models on the highly challenging TreeBench and V* Bench, proving its effectiveness in complex visual reasoning. Our work demonstrates that moving beyond simplistic rewards and instead guiding models with cognitively-inspired signals for what to see and how to think is crucial for unlocking the next level of multimodal intelligence. The code will be made publicly available.