KL-Regularised Q-Learning: A Token-level Action-Value perspective on Online RLHF

📅 2025-08-23

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

In LM-RLHF, the theoretical foundations of PPO remain weak, and KL divergence constraints are often handled heuristically. To address this, we propose KL-regularized Q-learning (KLQ)—the first token-level action-value method integrating Q-learning into language-generation RLHF. KLQ theoretically re-formulates policy optimization under KL constraints, establishing an explicit connection to PPO while enabling online training. It unifies token-level Q-value estimation, KL-regularized objective optimization, and LLM-as-a-judge evaluation. Experiments on summarization and single-turn dialogue tasks show that KLQ matches PPO’s performance and consistently outperforms it under LLM-based evaluation. By grounding RLHF in principled Q-learning theory, KLQ offers a more rigorous, theoretically grounded alternative to PPO for language model alignment.

Technology Category

Application Category

📝 Abstract

Proximal Policy Optimisation (PPO) is an established and effective policy gradient algorithm used for Language Model Reinforcement Learning from Human Feedback (LM-RLHF). PPO performs well empirically but has a heuristic motivation and handles the KL-divergence constraint used in LM-RLHF in an ad-hoc manner. In this paper, we develop a a new action-value RL method for the LM-RLHF setting, KL-regularised Q-Learning (KLQ). We then show that our method is equivalent to a version of PPO in a certain specific sense, despite its very different motivation. Finally, we benchmark KLQ on two key language generation tasks -- summarisation and single-turn dialogue. We demonstrate that KLQ performs on-par with PPO at optimising the LM-RLHF objective, and achieves a consistently higher win-rate against PPO on LLM-as-a-judge evaluations.

Problem

Research questions and friction points this paper is trying to address.

Develops KL-regularised Q-learning for language model RLHF

Establishes theoretical equivalence between KLQ and PPO methods

Benchmarks KLQ against PPO on summarization and dialogue tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

KL-regularised Q-learning method

Token-level action-value perspective

Equivalent to PPO version

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL