PRPO: Perception-Reinforced Policy Optimization via Token-Level Dynamic Advantage Reshaping

📅 2026-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitation of existing reinforcement learning approaches that rely on trajectory-level rewards, which fail to provide effective supervision for critical perceptual tokens and thus hinder the modeling of causal relationships between visual evidence and language generation. To overcome this, the authors propose a token-level reinforcement learning framework tailored for large vision-language models. The method introduces a novel Robust Visual Dependency (RVD) mechanism to identify key perceptual tokens and integrates Perception Advantage Reshaping (PAR) to enable dynamic, fine-grained credit assignment. This approach enhances salient signals while maintaining gradient stability. Evaluated across seven mainstream multimodal reasoning benchmarks, the framework achieves state-of-the-art performance, yielding average improvements of 23.3% and 21.1% for 3B and 7B models, respectively, and demonstrates strong cross-task generalization and training efficiency.

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective paradigm for improving the reasoning capability of Large Vision-Language Models (LVLMs). However, existing RLVR methods primarily rely on trajectory-level outcome rewards, which assign identical learning signals across all generated tokens. This coarse-grained credit assignment is fundamentally mismatched to multimodal reasoning, where only a sparse subset of tokens is causally grounded in visual evidence. Consequently, these pivotal perceptual tokens receive weak supervision and are often overwhelmed by language priors or reasoning-template tokens. To address this limitation, we propose Perception-Reinforced Policy Optimization (PRPO), a token-level reinforcement learning framework that explicitly identifies and reinforces pivotal perceptual tokens within long-horizon multimodal reasoning trajectories. PRPO introduces Robust Visual Dependency (RVD), a principled metric that identifies tokens whose predictions are both visually grounded and perturbation-stable, filtering out brittle or noisy visual tokens. Based on RVD, we further propose Perceptual Advantage Reshaping (PAR), a token-level credit assignment technique that amplifies perceptually informative tokens while preserving stable gradients for non-perceptual tokens. Extensive experiments on seven multimodal reasoning benchmarks demonstrate that PRPO consistently outperforms strong LVLM baselines across both 3B and 7B model scales, achieving average gains of 23.3% and 21.1%, respectively. PRPO achieves state-of-the-art performance with improved training efficiency and stronger cross-task generalization. Our findings highlight the importance of fine-grained credit assignment for scalable multimodal reinforcement learning.

Problem

Research questions and friction points this paper is trying to address.

credit assignment

multimodal reasoning

perceptual tokens

reinforcement learning

visual grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Perception-Reinforced Policy Optimization

Token-Level Credit Assignment

Robust Visual Dependency