🤖 AI Summary
This work addresses the lack of “slow-thinking” capability in vision-language models (VLMs) for multimodal mathematical and scientific reasoning. To this end, we propose a knowledge-distillation-free reinforcement learning framework built upon the GRPO algorithm. Methodologically, we introduce Selective Sample Replay to mitigate advantage signal decay and pioneer Forced Rethinking—a novel mechanism that compels the model to perform explicit, text-based reflection after answer generation, enabling end-to-end multi-step reasoning optimization. Experimental results demonstrate state-of-the-art performance among open-source models: 80.3% on MathVista, 61.8% on MathVerse, and 43.9% on MathVision. The approach further establishes new SOTA across multiple benchmarks—including MMMU-Pro, EMMA, and MEGA-Bench—significantly narrowing the performance gap with GPT-4o.
📝 Abstract
Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a textual rethinking trigger to the end of initial rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse, and MathVision to achieve 80.3%, 61.8%, and 43.9% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with GPT-o1.