Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance

📅 2025-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address training instability, high computational overhead, and static reward signals arising from Actor-Critic coupling in Reinforcement Learning from Human Feedback (RLHF), this paper proposes Decoupled Value-Policy Optimization (DVPO). Methodologically, DVPO fully decouples policy and value learning by introducing a pretrained Global Value Model (GVM) to replace the conventional reward model. Crucially, it pioneers the use of token-level return-to-go as a fixed, offline supervision signal—eliminating reliance on online critics and real environment rewards. Within the PPO framework, DVPO employs a frozen GVM to compute the RL objective, enabling stable and efficient optimization. Experiments show DVPO reduces GPU memory consumption by 40% and training time by 35% compared to standard PPO, while outperforming efficient alignment methods such as DPO and matching state-of-the-art PPO performance across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
Proximal Policy Optimization (PPO)-based Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human preferences. It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance. This approach increases computational complexity and instability due to actor-critic interdependence. Additionally, PPO lacks access to true environment rewards in LLM tasks, limiting its adaptability. Under such conditions, pretraining a value model or a reward model becomes equivalent, as both provide fixed supervisory signals without new ground-truth feedback. To address these issues, we propose extbf{Decoupled Value Policy Optimization (DVPO)}, a lean framework that replaces traditional reward modeling with a pretrained emph{global value model (GVM)}. The GVM is conditioned on policy trajectories and predicts token-level return-to-go estimates. By decoupling value model from policy training (via frozen GVM-driven RL objectives), DVPO eliminates actor-critic interdependence, reducing GPU memory usage by 40% and training time by 35% compared to conventional RLHF. Experiments across benchmarks show DVPO outperforms efficient RLHF methods (e.g., DPO) while matching state-of-the-art PPO in performance.
Problem

Research questions and friction points this paper is trying to address.

Decouples value model from policy training
Reduces GPU memory and training time
Improves efficiency in aligning LLMs with human preferences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples value model from policy training
Uses pretrained global value model guidance
Reduces GPU memory and training time
🔎 Similar Papers
No similar papers found.
Chenghua Huang
Chenghua Huang
Fudan University
Large Language ModelReinforcement Learning
L
Lu Wang
Microsoft
F
Fangkai Yang
Microsoft
P
Pu Zhao
Microsoft
Z
Zhixu Li
School of Computer Science, Fudan University
Qingwei Lin
Qingwei Lin
Microsoft
Dongmei Zhang
Dongmei Zhang
Microsoft Research
Software EngineeringMachine LearningInformation Visualization
S
S. Rajmohan
Microsoft
Q
Qi Zhang
Microsoft