Improving Plasticity in Non-stationary Reinforcement Learning with Evidential Proximal Policy Optimization

📅 2025-03-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address policy network overfitting and plasticity loss in non-stationary reinforcement learning, this paper proposes Evidential Proximal Policy Optimization (EPPO). EPPO is the first to integrate evidential deep learning into an on-policy framework, employing evidential neural networks to explicitly model both epistemic and aleatoric uncertainty in critic outputs—enabling probabilistic advantage estimation and uncertainty-aware optimistic exploration. A Bayesian regularization mechanism dynamically constrains policy updates, preserving network plasticity during non-stationary adaptation. Evaluated on continuous control tasks with dynamic periodic environment shifts, EPPO achieves significantly higher task-specific returns and cumulative rewards than mainstream on-policy baselines including PPO and TRPO. Key contributions are: (i) the first evidential learning framework tailored for on-policy RL; (ii) uncertainty-decomposition-driven probabilistic advantage estimation; and (iii) a synergistic mechanism jointly optimizing plasticity preservation and policy improvement.

Technology Category

Application Category

📝 Abstract

On-policy reinforcement learning algorithms use the most recently learned policy to interact with the environment and update it using the latest gathered trajectories, making them well-suited for adapting to non-stationary environments where dynamics change over time. However, previous studies show that they struggle to maintain plasticity$unicode{x2013}$the ability of neural networks to adjust their synaptic connections$unicode{x2013}$with overfitting identified as the primary cause. To address this, we present the first application of evidential learning in an on-policy reinforcement learning setting: $ extit{Evidential Proximal Policy Optimization (EPPO)}$. EPPO incorporates all sources of error in the critic network's approximation$unicode{x2013}$i.e., the baseline function in advantage calculation$unicode{x2013}$by modeling the epistemic and aleatoric uncertainty contributions to the approximation's total variance. We achieve this by using an evidential neural network, which serves as a regularizer to prevent overfitting. The resulting probabilistic interpretation of the advantage function enables optimistic exploration, thus maintaining the plasticity. Through experiments on non-stationary continuous control tasks, where the environment dynamics change at regular intervals, we demonstrate that EPPO outperforms state-of-the-art on-policy reinforcement learning variants in both task-specific and overall return.

Problem

Research questions and friction points this paper is trying to address.

Maintains plasticity in non-stationary reinforcement learning environments.

Prevents overfitting in neural networks using evidential learning.

Enhances performance in dynamic continuous control tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evidential Proximal Policy Optimization (EPPO) introduced.

Evidential neural network prevents overfitting effectively.

Probabilistic advantage function enables optimistic exploration.

🔎 Similar Papers

Can Learned Optimization Make Reinforcement Learning Less Difficult?

2024-07-09Neural Information Processing SystemsCitations: 3

Authors to Follow