Asymptotically optimal regret in communicating Markov decision processes

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

262K/year

🤖 AI Summary

This paper addresses the asymptotically optimal regret minimization problem for communicating average-reward Markov decision processes (CMDPs). To overcome the limitation of existing algorithms—failure to achieve the information-theoretic regret lower bound—we propose the first online learning algorithm attaining asymptotically optimal logarithmic regret $K(M)log T + o(log T)$. Methodologically, we introduce the novel “co-exploration” paradigm, establishing a tripartite trade-off among individual exploration, co-exploration, and exploitation. To handle the discontinuity of the optimal constant $K(M)$, we incorporate parameter regularization and empirically driven estimation, enabling arbitrary-precision approximation. We provide a rigorous theoretical analysis proving that the regret matches the information-theoretic lower bound tightly. Empirical evaluations demonstrate substantial improvements over baselines such as UCRL2, particularly in both accuracy of $K(M)$ estimation and convergence speed.

Technology Category

Application Category

📝 Abstract

In this paper, we present a learning algorithm that achieves asymptotically optimal regret for Markov decision processes in average reward under a communicating assumption. That is, given a communicating Markov decision process $M$, our algorithm has regret $K(M) log(T) + mathrm{o}(log(T))$ where $T$ is the number of learning steps and $K(M)$ is the best possible constant. This algorithm works by explicitly tracking the constant $K(M)$ to learn optimally, then balances the trade-off between exploration (playing sub-optimally to gain information), co-exploration (playing optimally to gain information) and exploitation (playing optimally to score maximally). We further show that the function $K(M)$ is discontinuous, which is a consequence challenge for our approach. To that end, we describe a regularization mechanism to estimate $K(M)$ with arbitrary precision from empirical data.

Problem

Research questions and friction points this paper is trying to address.

Achieve asymptotically optimal regret in Markov decision processes

Balance exploration, co-exploration, and exploitation trade-offs

Estimate discontinuous function K(M) with regularization mechanism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Achieves asymptotically optimal regret

Balances exploration, co-exploration, exploitation

Regularizes discontinuous function estimation

🔎 Similar Papers

On Bits and Bandits: Quantifying the Regret-Information Trade-off