Off-Policy Value-Based Reinforcement Learning for Large Language Models

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the low sample efficiency of on-policy reinforcement learning in large language models, which hinders long-horizon training. The authors propose ReVal, the first value-based off-policy reinforcement learning approach effectively applied to large language models. ReVal establishes a value-learning framework via Bellman updates, integrating step-level internal consistency with trajectory-level outcome verification signals, and leverages a replay buffer to enable efficient sample reuse. By moving beyond conventional policy gradient limitations, ReVal achieves faster convergence and superior performance, outperforming GRPO by 2.7% on AIME24 and by 4.5% on GPQA benchmarks.

Technology Category

Application Category

📝 Abstract

Improving data utilization efficiency is critical for scaling reinforcement learning (RL) for long-horizon tasks where generating trajectories is expensive. However, the dominant RL methods for LLMs are largely on-policy: they update each batch of data only once, discard it, and then collect fresh samples, resulting in poor sample efficiency. In this work, we explore an alternative value-based RL framework for LLMs that naturally enables off-policy learning. We propose ReVal, a Bellman-update-based method that combines stepwise signals capturing internal consistency with trajectory-level signals derived from outcome verification. ReVal naturally supports replay-buffer-based training, allowing efficient reuse of past trajectories. Experiments on standard mathematical reasoning benchmarks show that ReVal not only converges faster but also outperforms GRPO in final performance. On DeepSeek-R1-Distill-1.5B, ReVal improves training efficiency and achieves improvement of 2.7% in AIME24 and 4.5% in out-of-domain benchmark GPQA over GRPO. These results suggest that value-based RL is a practical alternative to policy-based methods for LLM training.

Problem

Research questions and friction points this paper is trying to address.

off-policy learning

sample efficiency

value-based reinforcement learning

large language models

long-horizon tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

off-policy reinforcement learning

value-based RL

replay buffer

Bellman update

large language models

🔎 Similar Papers

No similar papers found.

Authors to Follow