Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs

📅 2025-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fixed rollout allocation in reinforcement learning (RL) for large language models (LLMs) leads to inefficient training, while RL fine-tuning often induces exploration collapse, degrading response diversity. Method: We propose a dynamic difficulty-aware rollout allocation mechanism and an entropy-stabilized adaptive temperature control strategy. The former allocates simulation budget dynamically based on real-time problem difficulty to improve learning efficiency on hard instances; the latter maintains exploration capability by online entropy regularization, breaking the accuracy–diversity trade-off. Integrated within the PPO framework, our approach jointly leverages online difficulty estimation and entropy-constrained scheduling. Contribution/Results: On multiple reasoning benchmarks, our method reduces rollout consumption by 37% while significantly outperforming baselines—achieving, for the first time, simultaneous enhancement of both accuracy and exploration capacity during RL training.

Technology Category

Application Category

📝 Abstract
Reasoning large language models (LLMs) excel in complex tasks, which has drawn significant attention to reinforcement learning (RL) for LLMs. However, existing approaches allocate an equal number of rollouts to all questions during the RL process, which is inefficient. This inefficiency stems from the fact that training on simple questions yields limited gains, whereas more rollouts are needed for challenging questions to sample correct answers. Furthermore, while RL improves response precision, it limits the model's exploration ability, potentially resulting in a performance cap below that of the base model prior to RL. To address these issues, we propose a mechanism for dynamically allocating rollout budgets based on the difficulty of the problems, enabling more efficient RL training. Additionally, we introduce an adaptive dynamic temperature adjustment strategy to maintain the entropy at a stable level, thereby encouraging sufficient exploration. This enables LLMs to improve response precision while preserving their exploratory ability to uncover potential correct pathways. The code and data is available on: https://github.com/LiaoMengqi/E3-RL4LLMs
Problem

Research questions and friction points this paper is trying to address.

Inefficient equal rollout allocation in RL for LLMs
RL limits exploration, capping performance below base model
Dynamic rollout budget and temperature adjustment needed
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic rollout budget allocation by question difficulty
Adaptive dynamic temperature adjustment strategy
Balancing precision and exploration in RL
🔎 Similar Papers
No similar papers found.
M
Mengqi Liao
Department of Computer Science, Beijing Jiaotong University
Xiangyu Xi
Xiangyu Xi
Peking University; Meituan Group
natural language processingevent extractioninformation extractiontask-oriented dialogue
R
Ruinian Chen
Meituan
J
Jia Leng
Meituan
Y
Yangen Hu
Meituan
K
Ke Zeng
Meituan
S
Shuai Liu
Department of Computer Science, Beijing Jiaotong University
H
Huaiyu Wan
Department of Computer Science, Beijing Jiaotong University