Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

173K/year
🤖 AI Summary
Traditional policy gradient methods suffer from limited performance in sparse-reward environments due to the absence of effective exploration mechanisms. This work proposes the ReMax objective, which naturally incorporates exploration into policy optimization by maximizing the expected return over M retries, without requiring explicit exploration bonuses. The key innovation lies in formally casting the retry mechanism as a differentiable objective and generalizing the discrete number of retries to a continuous parameter, enabling fine-grained control over exploration intensity. Building on policy gradient theory, the authors derive the ReMax gradient estimator and integrate it into the PPO framework, yielding the ReMax PPO (RePPO) algorithm. Experimental results demonstrate that RePPO significantly outperforms baseline methods on MinAtar and Craftax benchmarks, effectively enhancing exploration without additional exploration incentives.
📝 Abstract
In reinforcement learning (RL), agents benefit from exploration only because they repeatedly encounter similar states: trying different actions can improve performance or reduce uncertainty; without such retries, a greedy policy is optimal. We formalize this intuition with ReMax, an objective that evaluates a policy by the expected maximum return over $M$ samples, where $M$ is a positive integer, while accounting for return uncertainty. Optimizing this objective induces stochastic exploration as an emergent property, without explicit bonus terms. For efficient policy optimization, we derive a new policy-gradient formulation for ReMax and introduce ReMax PPO (RePPO), a PPO variant that optimizes ReMax while generalizing the discrete retry count $M$ to a continuous parameter $m > 0$, enabling fine-grained control of exploration. Empirically, RePPO promotes exploration, without any explicit exploration bonuses, on the MinAtar and Craftax benchmarks.
Problem

Research questions and friction points this paper is trying to address.

exploration
policy gradient
reinforcement learning
retrying
stochastic policy
Innovation

Methods, ideas, or system contributions that make the work stand out.

ReMax
policy gradient
emergent exploration
RePPO
retry-based optimization