Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance

📅 2025-02-24

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

To address training instability, high computational overhead, and static reward signals arising from Actor-Critic coupling in Reinforcement Learning from Human Feedback (RLHF), this paper proposes Decoupled Value-Policy Optimization (DVPO). Methodologically, DVPO fully decouples policy and value learning by introducing a pretrained Global Value Model (GVM) to replace the conventional reward model. Crucially, it pioneers the use of token-level return-to-go as a fixed, offline supervision signal—eliminating reliance on online critics and real environment rewards. Within the PPO framework, DVPO employs a frozen GVM to compute the RL objective, enabling stable and efficient optimization. Experiments show DVPO reduces GPU memory consumption by 40% and training time by 35% compared to standard PPO, while outperforming efficient alignment methods such as DPO and matching state-of-the-art PPO performance across multiple benchmarks.

Technology Category

Application Category

📝 Abstract

Proximal Policy Optimization (PPO)-based Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human preferences. It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance. This approach increases computational complexity and instability due to actor-critic interdependence. Additionally, PPO lacks access to true environment rewards in LLM tasks, limiting its adaptability. Under such conditions, pretraining a value model or a reward model becomes equivalent, as both provide fixed supervisory signals without new ground-truth feedback. To address these issues, we propose extbf{Decoupled Value Policy Optimization (DVPO)}, a lean framework that replaces traditional reward modeling with a pretrained emph{global value model (GVM)}. The GVM is conditioned on policy trajectories and predicts token-level return-to-go estimates. By decoupling value model from policy training (via frozen GVM-driven RL objectives), DVPO eliminates actor-critic interdependence, reducing GPU memory usage by 40% and training time by 35% compared to conventional RLHF. Experiments across benchmarks show DVPO outperforms efficient RLHF methods (e.g., DPO) while matching state-of-the-art PPO in performance.

Problem

Research questions and friction points this paper is trying to address.

Decouples value model from policy training

Reduces GPU memory and training time

Improves efficiency in aligning LLMs with human preferences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples value model from policy training

Uses pretrained global value model guidance

Reduces GPU memory and training time

🔎 Similar Papers

Doubly Optimal Policy Evaluation for Reinforcement Learning