Minimax-Optimal Policy Regret in Partially Observable Markov Games

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the challenge of learning in partially observable Markov games against strategy-dependent adaptive adversaries, where standard regret notions become inadequate. The authors propose an optimistic maximum-likelihood algorithm that partitions the learning horizon into geometrically increasing phases, leveraging cumulative confidence sets and the aggregate Eluder dimension of the observable operator class to achieve low policy regret. Their theoretical analysis establishes matching upper and lower bounds, demonstrating that the algorithm attains $\widetilde{O}(\sqrt{T})$ policy regret under fixed parameters and achieves minimax optimality with respect to both $\sqrt{T}$ and the aggregate Eluder dimension. The approach is further extended to handle horizon-adaptive and geometrically decaying memory adversaries, incorporating a logarithmic policy comparison cost mechanism.

📝 Abstract

We study sequential decision-making in partially observable environments against strategic, adaptive opponents, modeled as partially observable Markov games (POMGs). The central challenge is to learn latent dynamics from partial observations while facing an adversary whose behavior depends on the learner's strategy, making standard regret notions inadequate. We prove that an epoch-based optimistic maximum-likelihood algorithm achieves $\tilde{O}(\sqrt{T})$ policy regret for fixed problem parameters, with explicit dependence on the horizon, adversary memory, confidence radius, and the aggregate Eluder dimension of the observable-operator class. The algorithm selects one policy per geometrically growing epoch using confidence sets built cumulatively from past data, which keeps the cost of comparing adversary responses across policies logarithmic in $T$. We also prove a lower bound matching the $\sqrt{T}$ and aggregate-Eluder-dimension dependence, up to problem-dependent and logarithmic factors. Finally, we extend the framework to horizon-adaptive guarantees and adversaries with geometric fading memory.

Problem

Research questions and friction points this paper is trying to address.

partially observable Markov games

policy regret

adaptive adversaries

sequential decision-making

latent dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

partially observable Markov games

policy regret

optimistic maximum-likelihood