🤖 AI Summary
This work addresses the dual challenges of catastrophic forgetting and the high cost of reward annotation in continual reinforcement learning by proposing a method that learns perception-driven, progress-aware rewards from a small number of unlabeled expert demonstration videos. The approach integrates a state-potential-based reward model with adversarial backward-regularization to mitigate distributional shift, and unifies reward learning, PPO, coreset experience replay, and synaptic intelligence within a native differentiable JAX framework to enable efficient and stable lifelong learning. Evaluated on ContinualBench and Meta-World, the method significantly reduces forgetting, accelerates learning, and outperforms existing visual reward and continual learning baselines—even surpassing an idealized perfect-memory agent—while demonstrating strong few-shot skill acquisition capabilities on a real-world robotic platform.
📝 Abstract
We present ProgAgent, a continual reinforcement learning (CRL) agent that unifies progress-aware reward learning with a high-throughput, JAX-native system architecture. Lifelong robotic learning grapples with catastrophic forgetting and the high cost of reward specification. ProgAgent tackles these by deriving dense, shaped rewards from unlabeled expert videos through a perceptual model that estimates task progress across initial, current, and goal observations. We theoretically interpret this as a learned state-potential function, delivering robust guidance in line with expert behaviors. To maintain stability amid online exploration - where novel, out-of-distribution states arise - we incorporate an adversarial push-back refinement that regularizes the reward model, curbing overconfident predictions on non-expert trajectories and countering distribution shift. By embedding this reward mechanism into a JIT-compiled loop, ProgAgent supports massively parallel rollouts and fully differentiable updates, rendering a sophisticated unified objective feasible: it merges PPO with coreset replay and synaptic intelligence for an enhanced stability-plasticity balance. Evaluations on ContinualBench and Meta-World benchmarks highlight ProgAgent's advantages: it markedly reduces forgetting, boosts learning speed, and outperforms key baselines in visual reward learning (e.g., Rank2Reward, TCN) and continual learning (e.g., Coreset, SI) - surpassing even an idealized perfect memory agent. Real-robot trials further validate its ability to acquire complex manipulation skills from noisy, few-shot human demonstrations.