🤖 AI Summary
Existing deep reinforcement learning exploration methods exhibit poor generalization across dense, sparse, and exploration-suppressing reward settings (e.g., those with action costs) and rely heavily on hyperparameter and noise-distribution tuning.
Method: We propose Stable Error-Exploiting Exploration (SEE), the first approach to treat maximization of temporal-difference (TD) error as a standalone exploration objective. SEE mitigates instability in TD-error-driven exploration via three mechanisms: off-policy error calibration, intra-episode error decoupling, and non-stationary error smoothing. It is seamlessly integrated into the Soft Actor-Critic (SAC) framework without altering SAC’s original optimization procedure or introducing auxiliary hyperparameters.
Contribution/Results: Empirical evaluation demonstrates that SEE achieves zero-shot, hyperparameter-free robust exploration across all three challenging reward regimes. It significantly enhances SAC’s cross-task generalization capability while preserving algorithmic simplicity and training stability.
📝 Abstract
Numerous heuristics and advanced approaches have been proposed for exploration in different settings for deep reinforcement learning. Noise-based exploration generally fares well with dense-shaped rewards and bonus-based exploration with sparse rewards. However, these methods usually require additional tuning to deal with undesirable reward settings by adjusting hyperparameters and noise distributions. Rewards that actively discourage exploration, i.e., with an action cost and no other dense signal to follow, can pose a major challenge. We propose a novel exploration method, Stable Error-seeking Exploration (SEE), that is robust across dense, sparse, and exploration-adverse reward settings. To this endeavor, we revisit the idea of maximizing the TD-error as a separate objective. Our method introduces three design choices to mitigate instability caused by far-off-policy learning, the conflict of interest of maximizing the cumulative TD-error in an episodic setting, and the non-stationary nature of TD-errors. SEE can be combined with off-policy algorithms without modifying the optimization pipeline of the original objective. In our experimental analysis, we show that a Soft-Actor Critic agent with the addition of SEE performs robustly across three diverse reward settings in a variety of tasks without hyperparameter adjustments.