Learning in Zero-Sum Markov Games: Relaxing Strong Reachability and Mixing Time Assumptions

📅 2023-12-13
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies decentralized learning in infinite-horizon zero-sum Markov games, where players observe only their own rewards and have no access to opponents’ strategies or shared information. To overcome the limitations of existing methods—which rely on strong global assumptions such as uniform reachability and mixing time—the authors introduce Tsallis entropy regularization into a decentralized Q-learning framework, proposing a smoothed best-response update mechanism. Crucially, convergence is guaranteed under a significantly weaker condition: the existence of an (unknown) reference policy with bounded reachability and mixing time. Theoretically, the algorithm converges to an ε-Nash equilibrium within finite time. This result substantially relaxes dependence on global MDP structure, enabling robust decentralized learning even in non-convex and non-stationary game settings, and establishes a more general theoretical foundation for distributed multi-agent reinforcement learning.
📝 Abstract
We address payoff-based decentralized learning in infinite-horizon zero-sum Markov games. In this setting, each player makes decisions based solely on received rewards, without observing the opponent's strategy or actions nor sharing information. Prior works established finite-time convergence to an approximate Nash equilibrium under strong reachability and mixing time assumptions. We propose a convergent algorithm that significantly relaxes these assumptions, requiring only the existence of a single policy (not necessarily known) with bounded reachability and mixing time. Our key technical novelty is introducing Tsallis entropy regularization to smooth the best-response policy updates. By suitably tuning this regularization, we ensure sufficient exploration, thus bypassing previous stringent assumptions on the MDP. By establishing novel properties of the value and policy updates induced by the Tsallis entropy regularizer, we prove finite-time convergence to an approximate Nash equilibrium.
Problem

Research questions and friction points this paper is trying to address.

Decentralized learning in zero-sum Markov games
Relaxing strong reachability and mixing assumptions
Finite-time convergence to approximate Nash equilibrium
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tsallis entropy regularization
bounded reachability policy
finite-time Nash equilibrium
🔎 Similar Papers
No similar papers found.