🤖 AI Summary
This work addresses the limited reasoning capabilities of vision-language models (VLMs) by systematically extending DeepSeek R1’s rule-based reinforcement learning (R1-RL) paradigm to multi-task visual understanding—a first in the field. Methodologically, we propose VLM-R1: a framework that constructs a vision-language joint reward function from deterministic annotations, designs a reward-robust regularization mechanism to mitigate reward hacking, and incorporates online policy optimization with a multi-scale, multi-task RL fine-tuning architecture. Key findings include the discovery of an “OD aha moment” phenomenon in object detection, identification of reward hacking causes, and empirical validation of scaling laws and strong sensitivity to data quality in VLM-RL. Experiments show VLM-R1 matches supervised fine-tuning performance on visual understanding tasks while significantly improving cross-task generalization. The code and models are publicly released to advance the VLM-RL community.
📝 Abstract
Recently DeepSeek R1 has shown that reinforcement learning (RL) can substantially improve the reasoning capabilities of Large Language Models (LLMs) through a simple yet effective design. The core of R1 lies in its rule-based reward formulation, which leverages tasks with deterministic ground-truth answers to enable precise and stable reward computation. In the visual domain, we similarly observe that a wide range of visual understanding tasks are inherently equipped with well-defined ground-truth annotations. This property makes them naturally compatible with rule-based reward mechanisms. Motivated by this observation, we investigate the extension of R1-style reinforcement learning to Vision-Language Models (VLMs), aiming to enhance their visual reasoning capabilities. To this end, we develop VLM-R1, a dedicated framework designed to harness RL for improving VLMs' performance on general vision-language tasks. Using this framework, we further explore the feasibility of applying RL to visual domain. Experimental results indicate that the RL-based model not only delivers competitive performance on visual understanding tasks but also surpasses Supervised Fine-Tuning (SFT) in generalization ability. Furthermore, we conduct comprehensive ablation studies that uncover a series of noteworthy insights, including the presence of reward hacking in object detection, the emergence of the"OD aha moment", the impact of training data quality, and the scaling behavior of RL across different model sizes. Through these analyses, we aim to deepen the understanding of how reinforcement learning enhances the capabilities of vision-language models, and we hope our findings and open-source contributions will support continued progress in the vision-language RL community. Our code and model are available at https://github.com/om-ai-lab/VLM-R1