Retry Policy Gradients in Continuous Action Spaces

πŸ“… 2026-06-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

171K/year
πŸ€– AI Summary
This work addresses the challenge of effective exploration in continuous action spaces without relying on explicit entropy regularization. To this end, the authors propose a pathwise derivative estimator that extends the ReMax objective to continuous control for the first time, yielding ReMACβ€”a policy gradient algorithm that eschews explicit entropy terms. ReMAC leverages retry mechanisms (e.g., max@K) to reshape the optimization landscape, thereby implicitly enhancing policy entropy. Theoretical analysis reveals that this approach promotes stochastic exploration by modulating both the direction and magnitude of policy gradients, and further elucidates the role of the Adam optimizer in ensuring convergence. Empirical evaluations demonstrate that ReMAC achieves performance on par with Soft Actor-Critic (SAC) across multiple continuous control benchmarks, confirming its efficacy and novelty.
πŸ“ Abstract
Retry-based objectives such as pass@K and max@K optimize the best return obtained from multiple sampled trajectories, and recent work has shown that they can promote exploration without explicit exploration bonuses. In discrete action spaces, ReMax was shown to do so by adapting to return uncertainty. In this work, we introduce pathwise derivative estimators for retry objectives and use them to extend ReMax to continuous action spaces. We study the resulting learning dynamics and show that, even with deterministic rewards, ReMax can encourage stochastic exploration by reshaping the policy-gradient landscape. In particular, it alters gradients both in direction, biasing updates toward higher policy entropy, and in magnitude, damping gradients and slowing convergence. We further show that Adam's adaptive normalization can mitigate this damping, depending on its numerical stabilization parameter. Empirically, we instantiate this objective as ReMax Actor-Critic (ReMAC), an off-policy actor--critic algorithm that optimizes the ReMax objective using a pathwise derivative estimator. Our experiments show that ReMAC can promote higher policy entropy without entropy regularization and achieves performance comparable to SAC.
Problem

Research questions and friction points this paper is trying to address.

retry policy gradients
continuous action spaces
exploration
policy entropy
ReMax
Innovation

Methods, ideas, or system contributions that make the work stand out.

ReMax
pathwise derivative estimator
continuous action spaces
policy entropy
retry-based objectives