🤖 AI Summary
Off-policy reinforcement learning algorithms—such as Soft Actor-Critic (SAC)—lack non-vacuous generalization guarantees due to temporal dependencies inherent in Markov decision processes, particularly when state-action sequences exhibit slow mixing. Method: This paper introduces the first PAC-Bayesian analysis that explicitly incorporates the mixing time of the underlying Markov chain to quantify weak dependence among state-action trajectories. Contribution/Results: It derives the first non-vacuous, mixing-time-coupled generalization bound applicable to modern off-policy actor-critic algorithms. Leveraging this bound, we propose PB-SAC—a boundary-driven variant of SAC that inherits its empirical performance while providing computable, meaningful confidence guarantees on generalization error. Empirical evaluation on continuous-control benchmarks confirms that PB-SAC matches SAC’s performance while delivering theoretically grounded, quantifiable generalization assurance—demonstrating the practical efficacy of theory-guided algorithm design.
📝 Abstract
We derive a novel PAC-Bayesian generalization bound for reinforcement learning that explicitly accounts for Markov dependencies in the data, through the chain's mixing time. This contributes to overcoming challenges in obtaining generalization guarantees for reinforcement learning, where the sequential nature of data breaks the independence assumptions underlying classical bounds. Our bound provides non-vacuous certificates for modern off-policy algorithms like Soft Actor-Critic. We demonstrate the bound's practical utility through PB-SAC, a novel algorithm that optimizes the bound during training to guide exploration. Experiments across continuous control tasks show that our approach provides meaningful confidence certificates while maintaining competitive performance.