Provable Memory Efficient Self-Play Algorithm for Model-free Reinforcement Learning

📅 2025-11-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the bottlenecks in multi-agent reinforcement learning (MARL)—namely, high memory overhead, large sample complexity, heavy computational burden, and non-Markovian policies—this paper proposes ME-Nash-QL, the first provably efficient model-free self-play algorithm for two-player zero-sum Markov games. Methodologically, it tightly integrates self-play with Q-learning–driven value iteration to enable memory-efficient policy search. Its key contributions are threefold: (i) optimal space complexity of $O(SABH)$; (ii) burn-in cost reduced to $O(SAB cdot ext{poly}(H))$; and (iii) strict preservation of Markov Nash equilibrium policies. Theoretically, ME-Nash-QL achieves a sample complexity of $ ilde{O}(H^4 SAB / varepsilon^2)$ and computational complexity of $O(T cdot ext{poly}(AB))$, significantly outperforming prior methods in long-horizon and large-state-space regimes. It thus establishes the current state-of-the-art in space, sample, and computational efficiency for this setting.

Technology Category

Application Category

📝 Abstract
The thriving field of multi-agent reinforcement learning (MARL) studies how a group of interacting agents make decisions autonomously in a shared dynamic environment. Existing theoretical studies in this area suffer from at least two of the following obstacles: memory inefficiency, the heavy dependence of sample complexity on the long horizon and the large state space, the high computational complexity, non-Markov policy, non-Nash policy, and high burn-in cost. In this work, we take a step towards settling this problem by designing a model-free self-play algorithm emph{Memory-Efficient Nash Q-Learning (ME-Nash-QL)} for two-player zero-sum Markov games, which is a specific setting of MARL. ME-Nash-QL is proven to enjoy the following merits. First, it can output an $varepsilon$-approximate Nash policy with space complexity $O(SABH)$ and sample complexity $widetilde{O}(H^4SAB/varepsilon^2)$, where $S$ is the number of states, ${A, B}$ is the number of actions for two players, and $H$ is the horizon length. It outperforms existing algorithms in terms of space complexity for tabular cases, and in terms of sample complexity for long horizons, i.e., when $min{A, B}ll H^2$. Second, ME-Nash-QL achieves the lowest computational complexity $O(Tmathrm{poly}(AB))$ while preserving Markov policies, where $T$ is the number of samples. Third, ME-Nash-QL also achieves the best burn-in cost $O(SAB,mathrm{poly}(H))$, whereas previous algorithms have a burn-in cost of at least $O(S^3 AB,mathrm{poly}(H))$ to attain the same level of sample complexity with ours.
Problem

Research questions and friction points this paper is trying to address.

Develops a memory-efficient self-play algorithm for two-player zero-sum Markov games.
Addresses high sample complexity dependence on long horizons and large state spaces.
Reduces computational complexity and burn-in cost compared to existing methods.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-efficient Nash Q-Learning algorithm for two-player zero-sum games
Low computational complexity with Markov policy preservation
Best burn-in cost and sample complexity for long horizons
🔎 Similar Papers
No similar papers found.
N
Na Li
College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, China
Y
Yuchen Jiao
School of Information and Electronics, Beijing Institute of Technology, Beijing, China
Hangguan Shan
Hangguan Shan
Zhejiang University
Wireless communications and wireless networking
S
Shefeng Yan
Institute of Acoustics, Chinese Academy of Sciences, Beijing, China