🤖 AI Summary
This paper addresses the asymptotically optimal regret minimization problem for communicating average-reward Markov decision processes (CMDPs). To overcome the limitation of existing algorithms—failure to achieve the information-theoretic regret lower bound—we propose the first online learning algorithm attaining asymptotically optimal logarithmic regret $K(M)log T + o(log T)$. Methodologically, we introduce the novel “co-exploration” paradigm, establishing a tripartite trade-off among individual exploration, co-exploration, and exploitation. To handle the discontinuity of the optimal constant $K(M)$, we incorporate parameter regularization and empirically driven estimation, enabling arbitrary-precision approximation. We provide a rigorous theoretical analysis proving that the regret matches the information-theoretic lower bound tightly. Empirical evaluations demonstrate substantial improvements over baselines such as UCRL2, particularly in both accuracy of $K(M)$ estimation and convergence speed.
📝 Abstract
In this paper, we present a learning algorithm that achieves asymptotically optimal regret for Markov decision processes in average reward under a communicating assumption. That is, given a communicating Markov decision process $M$, our algorithm has regret $K(M) log(T) + mathrm{o}(log(T))$ where $T$ is the number of learning steps and $K(M)$ is the best possible constant. This algorithm works by explicitly tracking the constant $K(M)$ to learn optimally, then balances the trade-off between exploration (playing sub-optimally to gain information), co-exploration (playing optimally to gain information) and exploitation (playing optimally to score maximally). We further show that the function $K(M)$ is discontinuous, which is a consequence challenge for our approach. To that end, we describe a regularization mechanism to estimate $K(M)$ with arbitrary precision from empirical data.