APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) suffer from “overthinking” during complex reasoning, reduced generalization, and degradation of general-purpose capabilities under reinforcement learning (RL) fine-tuning. Method: We propose an asymmetric policy optimization framework featuring (1) Difficulty-Adaptive KL-weight Scaling (DADS), which dynamically balances policy updates against pretraining knowledge retention via adaptive KL regularization; and (2) Suboptimal Trajectory Complexity Regularization (STCR), which penalizes redundant reasoning steps to enhance response conciseness and reasoning efficiency. The method integrates dynamic KL control, sequence-length regularization, and policy entropy constraints within a vision-language joint training paradigm. Results: Applied to Qwen2.5-VL-3B, our View-R1-3B model achieves a 7% average improvement in reasoning performance—surpassing 7–11B competing models—while preserving zero degradation on general-purpose benchmarks, significantly enhancing cross-task generalization and robustness.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) are powerful at integrating diverse data, but they often struggle with complex reasoning. While Reinforcement learning (RL) can boost reasoning in LLMs, applying it to MLLMs is tricky. Common issues include a drop in performance on general tasks and the generation of overly detailed or "overthinking" reasoning. Our work investigates how the KL penalty and overthinking affect RL training in MLLMs. We propose Asymmetric Policy Optimization (APO) to address these issues, which divides the sampled responses into positive and negative groups. For positive samples, Difficulty-Adaptive Divergence Shaping (DADS) is introduced to dynamically adjust the KL divergence weight based on their difficulty. This method prevents policy entropy from dropping sharply, improves training stability, utilizes samples better, and preserves the model's existing knowledge. For negative samples, Suboptimal Trajectory Complexity Regularization (STCR) is proposed to penalize overly long responses. This helps mitigate overthinking and encourages more concise reasoning while preserving the model's explorative capacity. We apply our method to Qwen2.5-VL-3B, creating View-R1-3B. View-R1-3B significantly enhances reasoning capabilities, showing an average 7% gain over the base model and outperforming larger MLLMs (7-11B) on various reasoning benchmarks. Importantly, unlike other reasoning-tuned MLLMs that often degrade on general tasks, View-R1-3B maintains consistent improvement, demonstrating superior generalization. These results highlight the effectiveness and broad applicability of our DADS and STCR techniques for advancing complex multimodal reasoning in MLLMs. The code will be made available at https://github.com/Indolent-Kawhi/View-R1.
Problem

Research questions and friction points this paper is trying to address.

Enhancing reasoning ability in Multimodal Large Language Models (MLLMs)
Addressing performance drop and overthinking in RL-trained MLLMs)
Balancing reasoning improvement with general task performance)
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric Policy Optimization divides responses into groups
Difficulty-Adaptive Divergence Shaping adjusts KL dynamically
Suboptimal Trajectory Complexity Regularization penalizes long responses
🔎 Similar Papers
No similar papers found.
Minjie Hong
Minjie Hong
Zhejiang University
Multi-modal LearningLLMReinforcement learningGenerative RetrievalRecommendation
Z
Zirun Guo
Zhejiang University
Y
Yan Xia
Zhejiang University
Z
Zehan Wang
Zhejiang University
Z
Ziang Zhang
Zhejiang University
T
Tao Jin
Zhejiang University
Zhou Zhao
Zhou Zhao
Zhejiang University
Machine LearningData MiningMultimedia Computing