KL-Regularised Q-Learning: A Token-level Action-Value perspective on Online RLHF

πŸ“… 2025-08-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In LM-RLHF, the theoretical foundations of PPO remain weak, and KL divergence constraints are often handled heuristically. To address this, we propose KL-regularized Q-learning (KLQ)β€”the first token-level action-value method integrating Q-learning into language-generation RLHF. KLQ theoretically re-formulates policy optimization under KL constraints, establishing an explicit connection to PPO while enabling online training. It unifies token-level Q-value estimation, KL-regularized objective optimization, and LLM-as-a-judge evaluation. Experiments on summarization and single-turn dialogue tasks show that KLQ matches PPO’s performance and consistently outperforms it under LLM-based evaluation. By grounding RLHF in principled Q-learning theory, KLQ offers a more rigorous, theoretically grounded alternative to PPO for language model alignment.

Technology Category

Application Category

πŸ“ Abstract
Proximal Policy Optimisation (PPO) is an established and effective policy gradient algorithm used for Language Model Reinforcement Learning from Human Feedback (LM-RLHF). PPO performs well empirically but has a heuristic motivation and handles the KL-divergence constraint used in LM-RLHF in an ad-hoc manner. In this paper, we develop a a new action-value RL method for the LM-RLHF setting, KL-regularised Q-Learning (KLQ). We then show that our method is equivalent to a version of PPO in a certain specific sense, despite its very different motivation. Finally, we benchmark KLQ on two key language generation tasks -- summarisation and single-turn dialogue. We demonstrate that KLQ performs on-par with PPO at optimising the LM-RLHF objective, and achieves a consistently higher win-rate against PPO on LLM-as-a-judge evaluations.
Problem

Research questions and friction points this paper is trying to address.

Develops KL-regularised Q-learning for language model RLHF
Establishes theoretical equivalence between KLQ and PPO methods
Benchmarks KLQ against PPO on summarization and dialogue tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

KL-regularised Q-learning method
Token-level action-value perspective
Equivalent to PPO version
πŸ”Ž Similar Papers
No similar papers found.