🤖 AI Summary
To address training instability, high computational overhead, and static reward signals arising from Actor-Critic coupling in Reinforcement Learning from Human Feedback (RLHF), this paper proposes Decoupled Value-Policy Optimization (DVPO). Methodologically, DVPO fully decouples policy and value learning by introducing a pretrained Global Value Model (GVM) to replace the conventional reward model. Crucially, it pioneers the use of token-level return-to-go as a fixed, offline supervision signal—eliminating reliance on online critics and real environment rewards. Within the PPO framework, DVPO employs a frozen GVM to compute the RL objective, enabling stable and efficient optimization. Experiments show DVPO reduces GPU memory consumption by 40% and training time by 35% compared to standard PPO, while outperforming efficient alignment methods such as DPO and matching state-of-the-art PPO performance across multiple benchmarks.
📝 Abstract
Proximal Policy Optimization (PPO)-based Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human preferences. It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance. This approach increases computational complexity and instability due to actor-critic interdependence. Additionally, PPO lacks access to true environment rewards in LLM tasks, limiting its adaptability. Under such conditions, pretraining a value model or a reward model becomes equivalent, as both provide fixed supervisory signals without new ground-truth feedback. To address these issues, we propose extbf{Decoupled Value Policy Optimization (DVPO)}, a lean framework that replaces traditional reward modeling with a pretrained emph{global value model (GVM)}. The GVM is conditioned on policy trajectories and predicts token-level return-to-go estimates. By decoupling value model from policy training (via frozen GVM-driven RL objectives), DVPO eliminates actor-critic interdependence, reducing GPU memory usage by 40% and training time by 35% compared to conventional RLHF. Experiments across benchmarks show DVPO outperforms efficient RLHF methods (e.g., DPO) while matching state-of-the-art PPO in performance.