Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization

πŸ“… 2026-04-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

236K/year
πŸ€– AI Summary
This study investigates the impact of composite verifiable rewards on the convergence of policy optimization in reinforcement fine-tuning of vision-language foundation models, and explains why small-scale tool-augmented training generalizes effectively to out-of-distribution scenarios. To this end, the authors propose a theoretical framework termed Tool-Augmented Markov Decision Process (TA-MDP), which establishes, for the first time, convergence rate guarantees and a reward decomposition theorem for Group Relative Policy Optimization (GRPO) under composite rewards. Furthermore, they derive a PAC-Bayes generalization bound for tool-augmented policies. This work provides a rigorous theoretical foundation for reinforcement fine-tuning driven by multi-component verifiable rewards and elucidates the convergence and generalization mechanisms underlying methods such as Visual-ARFT.

Technology Category

Application Category

πŸ“ Abstract
Reinforcement fine-tuning with verifiable rewards (RLVR) has emerged as a powerful paradigm for equipping large vision-language models (LVLMs) with agentic capabilities such as tool use and multi-step reasoning. Despite striking empirical successes, most notably Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT), the theoretical underpinnings of this paradigm remain poorly understood. In particular, two critical questions lack rigorous answers: (i)~how does the composite structure of verifiable rewards (format compliance, answer accuracy, tool executability) affect the convergence of Group Relative Policy Optimization (GRPO), and (ii)~why does training on a small set of tool-augmented tasks transfer to out-of-distribution domains? We address these gaps by introducing the \emph{Tool-Augmented Markov Decision Process} (TA-MDP), a formal framework that models multimodal agentic decision-making with bounded-depth tool calls. Within this framework, we establish three main results. First, we prove that GRPO under composite verifiable rewards converges to a first-order stationary point at rate $O(1/\sqrt{T})$ with explicit dependence on the number of reward components and group size (\textbf{Theorem~1}). Second, we derive a \emph{Reward Decomposition Theorem} that bounds the sub-optimality gap between decomposed per-component optimization and joint optimization, providing a precise characterization of when reward decomposition is beneficial (\textbf{Theorem~2}). Third, we establish a PAC-Bayes generalization bound for tool-augmented policies that explains the strong out-of-distribution transfer observed in Visual-ARFT (\textbf{Theorem~3}).
Problem

Research questions and friction points this paper is trying to address.

reinforcement fine-tuning
large vision-language models
verifiable rewards
convergence
generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tool-Augmented MDP
Reward Decomposition
GRPO Convergence
PAC-Bayes Generalization
Verifiable Rewards
C
Carter Adams
Federal University of Bahia
Rafael Oliveira
Rafael Oliveira
The Federal University of Technology – ParanΓ‘ (UTFPR)
Software EngineeringSoftware TestingTest oraclesSoftware Processes
G
Gabriel Almeida
Federal University of Bahia
S
Sofia Torres
Federal University of Bahia