Off-Policy Value-Based Reinforcement Learning for Large Language Models

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the low sample efficiency of on-policy reinforcement learning in large language models, which hinders long-horizon training. The authors propose ReVal, the first value-based off-policy reinforcement learning approach effectively applied to large language models. ReVal establishes a value-learning framework via Bellman updates, integrating step-level internal consistency with trajectory-level outcome verification signals, and leverages a replay buffer to enable efficient sample reuse. By moving beyond conventional policy gradient limitations, ReVal achieves faster convergence and superior performance, outperforming GRPO by 2.7% on AIME24 and by 4.5% on GPQA benchmarks.

Technology Category

Application Category

📝 Abstract
Improving data utilization efficiency is critical for scaling reinforcement learning (RL) for long-horizon tasks where generating trajectories is expensive. However, the dominant RL methods for LLMs are largely on-policy: they update each batch of data only once, discard it, and then collect fresh samples, resulting in poor sample efficiency. In this work, we explore an alternative value-based RL framework for LLMs that naturally enables off-policy learning. We propose ReVal, a Bellman-update-based method that combines stepwise signals capturing internal consistency with trajectory-level signals derived from outcome verification. ReVal naturally supports replay-buffer-based training, allowing efficient reuse of past trajectories. Experiments on standard mathematical reasoning benchmarks show that ReVal not only converges faster but also outperforms GRPO in final performance. On DeepSeek-R1-Distill-1.5B, ReVal improves training efficiency and achieves improvement of 2.7% in AIME24 and 4.5% in out-of-domain benchmark GPQA over GRPO. These results suggest that value-based RL is a practical alternative to policy-based methods for LLM training.
Problem

Research questions and friction points this paper is trying to address.

off-policy learning
sample efficiency
value-based reinforcement learning
large language models
long-horizon tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

off-policy reinforcement learning
value-based RL
replay buffer
Bellman update
large language models
🔎 Similar Papers
No similar papers found.
P
Peng-Yuan Wang
National Key Laboratory for Novel Software Technology & School of Artificial Intelligence, Nanjing University, China
Ziniu Li
Ziniu Li
The Chinese University of Hong Kong, Shenzhen
Machine LearningReinforcement LearningLarge Language Models
Tian Xu
Tian Xu
Nanjing University
Reinforcement Learning
B
Bohan Yang
National Key Laboratory for Novel Software Technology & School of Artificial Intelligence, Nanjing University, China
T
Tian-Shuo Liu
National Key Laboratory for Novel Software Technology & School of Artificial Intelligence, Nanjing University, China
C
ChenYang Wang
National Key Laboratory for Novel Software Technology & School of Artificial Intelligence, Nanjing University, China
X
Xiong-Hui Chen
National Key Laboratory for Novel Software Technology & School of Artificial Intelligence, Nanjing University, China
Yi-Chen Li
Yi-Chen Li
Nanjing University
Reinforcement LearningImitation LearningRLHF
Tianyun Yang
Tianyun Yang
Shenzhen Research Institute of Big Data
Mechanistic InterpretabilityLarge Vision-Language ModelAI Safety
Congliang Chen
Congliang Chen
Ph.D. Student, the Chinese University of Hong Kong (Shenzhen)
OptimizationMachine Learning
Yang Yu
Yang Yu
Professor, Nanjing University
Artificial IntelligenceReinforcement LearningEvolutionary Algorithms