Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Traditional policy gradient methods suffer from limited performance in sparse-reward environments due to the absence of effective exploration mechanisms. This work proposes the ReMax objective, which naturally incorporates exploration into policy optimization by maximizing the expected return over M retries, without requiring explicit exploration bonuses. The key innovation lies in formally casting the retry mechanism as a differentiable objective and generalizing the discrete number of retries to a continuous parameter, enabling fine-grained control over exploration intensity. Building on policy gradient theory, the authors derive the ReMax gradient estimator and integrate it into the PPO framework, yielding the ReMax PPO (RePPO) algorithm. Experimental results demonstrate that RePPO significantly outperforms baseline methods on MinAtar and Craftax benchmarks, effectively enhancing exploration without additional exploration incentives.

📝 Abstract

In reinforcement learning (RL), agents benefit from exploration only because they repeatedly encounter similar states: trying different actions can improve performance or reduce uncertainty; without such retries, a greedy policy is optimal. We formalize this intuition with ReMax, an objective that evaluates a policy by the expected maximum return over $M$ samples, where $M$ is a positive integer, while accounting for return uncertainty. Optimizing this objective induces stochastic exploration as an emergent property, without explicit bonus terms. For efficient policy optimization, we derive a new policy-gradient formulation for ReMax and introduce ReMax PPO (RePPO), a PPO variant that optimizes ReMax while generalizing the discrete retry count $M$ to a continuous parameter $m > 0$, enabling fine-grained control of exploration. Empirically, RePPO promotes exploration, without any explicit exploration bonuses, on the MinAtar and Craftax benchmarks.

Problem

Research questions and friction points this paper is trying to address.

exploration

policy gradient

reinforcement learning

retrying

stochastic policy

Innovation

Methods, ideas, or system contributions that make the work stand out.

ReMax

policy gradient

emergent exploration