🤖 AI Summary
This paper studies the sample complexity of the plug-in method for learning an ε-optimal policy in average-reward MDPs under generative models. Unlike prior approaches that rely on discounting—requiring either prior knowledge (e.g., diameter D or uniform mixing time τ_unif) or hyperparameter tuning—the plug-in method has long lacked rigorous theoretical analysis. We establish the first formal theoretical framework for it: (i) we derive a tight, span-based upper bound independent of D and τ_unif; (ii) we construct a matching algorithm-specific lower bound, proving its optimality; and (iii) we develop a novel discounted plug-in analysis technique without reward perturbation, circumventing the effective horizon limitation. Our results yield optimal sample complexities of Õ(SA D/ε²) and Õ(SA τ_unif/ε²), substantially improving upon existing bounds. This work provides the first optimal theoretical guarantee for model-based average-reward reinforcement learning that is both parameter-free and prior-free.
📝 Abstract
We study the sample complexity of the plug-in approach for learning $varepsilon$-optimal policies in average-reward Markov decision processes (MDPs) with a generative model. The plug-in approach constructs a model estimate then computes an average-reward optimal policy in the estimated model. Despite representing arguably the simplest algorithm for this problem, the plug-in approach has never been theoretically analyzed. Unlike the more well-studied discounted MDP reduction method, the plug-in approach requires no prior problem information or parameter tuning. Our results fill this gap and address the limitations of prior approaches, as we show that the plug-in approach is optimal in several well-studied settings without using prior knowledge. Specifically it achieves the optimal diameter- and mixing-based sample complexities of $widetilde{O}left(SA frac{D}{varepsilon^2}
ight)$ and $widetilde{O}left(SA frac{ au_{mathrm{unif}}}{varepsilon^2}
ight)$, respectively, without knowledge of the diameter $D$ or uniform mixing time $ au_{mathrm{unif}}$. We also obtain span-based bounds for the plug-in approach, and complement them with algorithm-specific lower bounds suggesting that they are unimprovable. Our results require novel techniques for analyzing long-horizon problems which may be broadly useful and which also improve results for the discounted plug-in approach, removing effective-horizon-related sample size restrictions and obtaining the first optimal complexity bounds for the full range of sample sizes without reward perturbation.