🤖 AI Summary
Empirical replay in reinforcement learning lacks statistical foundations, leading to high variance and low efficiency in policy evaluation under small-sample regimes.
Method: This paper establishes the first statistical modeling framework for experience replay, rigorously characterizing it as a variance-reduction mechanism grounded in U- and V-statistics. We extend this framework to model-free policy evaluation algorithms—including Least-Squares Temporal Difference (LSTD) and PDE-based methods—and integrate it with kernel ridge regression.
Contribution/Results: Theoretical analysis provides strict statistical guarantees on bias–variance trade-offs and convergence. Empirically, our approach significantly improves estimation stability and reduces computational complexity for kernel ridge regression from $O(n^3)$ to $O(n^2)$, while simultaneously lowering variance. Experiments validate the theoretical claims and demonstrate strong cross-task generalization. Our work introduces a new paradigm for small-sample reinforcement learning and nonparametric regression—uniquely combining statistical rigor with computational feasibility.
📝 Abstract
Experience replay is a foundational technique in reinforcement learning that enhances learning stability by storing past experiences in a replay buffer and reusing them during training. Despite its practical success, its theoretical properties remain underexplored. In this paper, we present a theoretical framework that models experience replay using resampled $U$- and $V$-statistics, providing rigorous variance reduction guarantees. We apply this framework to policy evaluation tasks using the Least-Squares Temporal Difference (LSTD) algorithm and a Partial Differential Equation (PDE)-based model-free algorithm, demonstrating significant improvements in stability and efficiency, particularly in data-scarce scenarios. Beyond policy evaluation, we extend the framework to kernel ridge regression, showing that the experience replay-based method reduces the computational cost from the traditional $O(n^3)$ in time to as low as $O(n^2)$ in time while simultaneously reducing variance. Extensive numerical experiments validate our theoretical findings, demonstrating the broad applicability and effectiveness of experience replay in diverse machine learning tasks.