π€ AI Summary
In LM-RLHF, the theoretical foundations of PPO remain weak, and KL divergence constraints are often handled heuristically. To address this, we propose KL-regularized Q-learning (KLQ)βthe first token-level action-value method integrating Q-learning into language-generation RLHF. KLQ theoretically re-formulates policy optimization under KL constraints, establishing an explicit connection to PPO while enabling online training. It unifies token-level Q-value estimation, KL-regularized objective optimization, and LLM-as-a-judge evaluation. Experiments on summarization and single-turn dialogue tasks show that KLQ matches PPOβs performance and consistently outperforms it under LLM-based evaluation. By grounding RLHF in principled Q-learning theory, KLQ offers a more rigorous, theoretically grounded alternative to PPO for language model alignment.
π Abstract
Proximal Policy Optimisation (PPO) is an established and effective policy gradient algorithm used for Language Model Reinforcement Learning from Human Feedback (LM-RLHF). PPO performs well empirically but has a heuristic motivation and handles the KL-divergence constraint used in LM-RLHF in an ad-hoc manner. In this paper, we develop a a new action-value RL method for the LM-RLHF setting, KL-regularised Q-Learning (KLQ). We then show that our method is equivalent to a version of PPO in a certain specific sense, despite its very different motivation. Finally, we benchmark KLQ on two key language generation tasks -- summarisation and single-turn dialogue. We demonstrate that KLQ performs on-par with PPO at optimising the LM-RLHF objective, and achieves a consistently higher win-rate against PPO on LLM-as-a-judge evaluations.