MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

This work addresses key challenges in multimodal reasoning—namely, the difficulty of applying reinforcement learning (RL), low data efficiency, and poor interpretability—by proposing the first rule-based RL paradigm tailored for vision-language joint tasks. Methodologically, it integrates multimodal alignment modeling, reward shaping, and a self-supervised reflection mechanism, enabling RL fine-tuning of both pretrained and instruction-tuned models without supervised fine-tuning. Experiments demonstrate that the model spontaneously develops reflective behavior during multimodal reasoning: accuracy improves progressively, response length increases, and—for the first time in visual space—the “Aha Moment” phenomenon, previously observed only in text-based RL, is replicated. Compared to supervised fine-tuning, the approach achieves significantly higher data efficiency. The project fully open-sources code, models, and datasets to advance standardized research in multimodal RL.

Technology Category

Application Category

📝 Abstract

We present MM-Eureka, a multimodal reasoning model that successfully extends large-scale rule-based reinforcement learning (RL) to multimodal reasoning. While rule-based RL has shown remarkable success in improving LLMs' reasoning abilities in text domains, its application to multimodal settings has remained challenging. Our work reproduces key characteristics of text-based RL systems like DeepSeek-R1 in the multimodal space, including steady increases in accuracy reward and response length, and the emergence of reflection behaviors. We demonstrate that both instruction-tuned and pre-trained models can develop strong multimodal reasoning capabilities through rule-based RL without supervised fine-tuning, showing superior data efficiency compared to alternative approaches. We open-source our complete pipeline to foster further research in this area. We release all our codes, models, data, etc. at https://github.com/ModalMinds/MM-EUREKA

Problem

Research questions and friction points this paper is trying to address.

Extends rule-based RL to multimodal reasoning.

Develops multimodal reasoning without supervised fine-tuning.

Demonstrates superior data efficiency in multimodal RL.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends rule-based RL to multimodal reasoning

Develops multimodal capabilities without supervised fine-tuning

Open-sources complete pipeline for further research

🔎 Similar Papers

Zero-Shot Generalization of Vision-Based RL Without Data Augmentation