R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) compulsorily invoke chain-of-thought (CoT) reasoning even for simple queries, resulting in redundant inference and suboptimal efficiency. This work proposes a dual-mode adaptive reasoning framework that enables models to jointly acquire both “direct answering” and “stepwise reasoning” capabilities during training via dual-mode annealing and policy optimization, with dynamic mode selection guided by reinforcement learning. The approach integrates multi-stage training, an enhanced Generalized Reward Policy Optimization (GRPO) algorithm, dual-mode generation, and a cross-domain fine-grained dataset. Evaluated on 25 benchmarks, our method achieves state-of-the-art performance across the board—outperforming Qwen2.5-VL-7B on most tasks and matching the accuracy of 16B-class models on inference-intensive tasks, while substantially reducing computational overhead. Its core innovation lies in the first realization of fine-grained, learnable, and adaptive switching between reasoning modes in MLLMs.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) equipped with step-by-step thinking capabilities have demonstrated remarkable performance on complex reasoning problems. However, this thinking process is redundant for simple problems solvable without complex reasoning. To address this inefficiency, we propose R-4B, an auto-thinking MLLM, which can adaptively decide when to think based on problem complexity. The central idea of R-4B is to empower the model with both thinking and non-thinking capabilities using bi-mode annealing, and apply Bi-mode Policy Optimization~(BPO) to improve the model's accuracy in determining whether to activate the thinking process. Specifically, we first train the model on a carefully curated dataset spanning various topics, which contains samples from both thinking and non-thinking modes. Then it undergoes a second phase of training under an improved GRPO framework, where the policy model is forced to generate responses from both modes for each input query. Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks. It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.
Problem

Research questions and friction points this paper is trying to address.

Adaptively deciding when to activate thinking based on problem complexity
Reducing inefficiency of step-by-step thinking for simple problems
Improving accuracy in determining thinking mode activation through policy optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bi-mode annealing for adaptive thinking
Reinforce learning with Bi-mode Policy Optimization
Dual-mode training for problem complexity adaptation
🔎 Similar Papers
No similar papers found.