Towards Efficient Online Exploration for Reinforcement Learning with Human Feedback

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficiently collecting human preference data in online Reinforcement Learning from Human Feedback (RLHF) to jointly optimize the reward model and policy. We propose a multi-armed bandit–based adaptive exploration mechanism that prioritizes reducing uncertainty in reward differences most critical for policy improvement, and introduce a hyperparameter β to balance reward maximization against policy distributional shift. Theoretically, our method achieves a regret bound of $T^{(β+1)/(β+2)}$, providing the first polynomial-regret guarantee for online RLHF—significantly improving upon existing linear-regret approaches. Empirically, it outperforms mainstream optimistic exploration methods in both data efficiency and alignment performance. Our framework establishes a new paradigm for online RLHF that is both theoretically rigorous and practically viable.

Technology Category

Application Category

📝 Abstract
Reinforcement learning with human feedback (RLHF), which learns a reward model from human preference data and then optimizes a policy to favor preferred responses, has emerged as a central paradigm for aligning large language models (LLMs) with human preferences. In this paper, we investigate exploration principles for online RLHF, where one seeks to adaptively collect new preference data to refine both the reward model and the policy in a data-efficient manner. By examining existing optimism-based exploration algorithms, we identify a drawback in their sampling protocol: they tend to gather comparisons that fail to reduce the most informative uncertainties in reward differences, and we prove lower bounds showing that such methods can incur linear regret over exponentially long horizons. Motivated by this insight, we propose a new exploration scheme that directs preference queries toward reducing uncertainty in reward differences most relevant to policy improvement. Under a multi-armed bandit model of RLHF, we establish regret bounds of order $T^{(β+1)/(β+2)}$, where $β>0$ is a hyperparameter that balances reward maximization against mitigating distribution shift. To our knowledge, this is the first online RLHF algorithm with regret scaling polynomially in all model parameters.
Problem

Research questions and friction points this paper is trying to address.

Explores efficient online data collection for RLHF with human feedback
Identifies limitations in existing exploration algorithms causing linear regret
Proposes new exploration scheme to reduce reward uncertainty for policy improvement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Directs preference queries to reduce reward uncertainty
Establishes polynomial regret bounds for online RLHF
Balances reward maximization against distribution shift
G
Gen Li
Department of Statistics and Data Science, The Chinese University of Hong Kong, Hong Kong
Yuling Yan
Yuling Yan
Assistant Professor, University of Wisconsin-Madison
StatisticsOptimizationReinforcement LearningDiffusion Model