🤖 AI Summary
This work addresses the instability of multimodal large language models in multi-step reasoning, often caused by divergent reasoning paths. To mitigate this, the authors propose the CASHEW framework, which enhances stability during inference by iteratively aggregating multiple candidate reasoning trajectories and filtering hallucinated steps through visual verification. They further introduce CASHEW-RL, which internalizes this aggregation capability into a single model via Group Sequence Policy Optimization (GSPO) and a composite reward function based on minimal sufficient visual evidence, enabling self-aggregation during reasoning. This study is the first to apply test-time scaling principles to multimodal reasoning stability, establishing a complementary mechanism between explicit and internalized aggregation. The approach achieves significant performance gains across 13 benchmarks, with absolute improvements of 23.6 and 8.1 percentage points on ScienceQA and EgoSchema, respectively.
📝 Abstract
Vision-language models achieve strong performance across a wide range of multimodal understanding and reasoning tasks, yet their multi-step reasoning remains unstable. Repeated sampling over the same input often produces divergent reasoning trajectories and inconsistent final predictions. To address this, we introduce two complementary approaches inspired by test-time scaling: (1) CASHEW, an inference-time framework that stabilizes reasoning by iteratively aggregating multiple candidate trajectories into higher-quality reasoning traces, with explicit visual verification filtering hallucinated steps and grounding reasoning in visual evidence, and (2) CASHEW-RL, a learned variant that internalizes this aggregation behavior within a single model. CASHEW-RL is trained using Group Sequence Policy Optimization (GSPO) with a composite reward that encourages correct answers grounded in minimal yet sufficient visual evidence, while adaptively allocating reasoning effort based on task difficulty. This training objective enables robust self-aggregation at inference. Extensive experiments on 13 image understanding, video understanding, and video reasoning benchmarks show significant performance improvements, including gains of up to +23.6 percentage points on ScienceQA and +8.1 percentage points on EgoSchema.