🤖 AI Summary
In multi-agent games, the coexistence of multiple suboptimal Nash equilibria often impedes standard reinforcement learning from achieving coordinated outcomes with high social welfare. This work proposes a novel approach that integrates swap regret minimization, a centralized attention-based critic, and Lagrangian equilibrium selection, unifying scalable vector-valued regret estimation with social welfare optimization within a deep multi-agent reinforcement learning framework for the first time. The method steers agents toward convergence to Pareto-efficient correlated equilibria, significantly enhancing collective returns and fairness across diverse benchmarks—including matrix games, Multi-Agent Particle Environments (MPE), and the Melting Pot Harvest task—thereby enabling efficient and stable coordination.
📝 Abstract
Real-world multi-agent systems, from traffic coordination to resource allocation, are often modeled as general-sum games where individual incentives conflict with collective welfare. In these settings, the central challenge is not merely finding an equilibrium, but selecting socially desirable outcomes among many suboptimal Nash equilibria. Standard deep multi-agent reinforcement learning (MARL) methods struggle with this problem, as value-decomposition approaches are constrained by monotonicity assumptions and policy-gradient methods often converge to stable but socially inefficient equilibria. To address this limitation, we propose $Φ$-Actor-Critic ($Φ$-AC), a framework that leverages swap regret minimization to steer learning toward high-welfare correlated equilibria (CE). To make counterfactual regret estimation tractable in deep MARL, $Φ$-AC employs a centralized attention critic that predicts vector-valued regrets in a single forward pass, avoiding computationally expensive counterfactual simulations. We further introduce a Lagrangian-based equilibrium selection mechanism that optimizes social welfare while enforcing stability through regret constraints. Experiments on matrix games, Multi-Agent Particle Environments (MPE), and the Melting Pot Harvest scenario demonstrate that $Φ$-AC learns efficient and stable coordination strategies across diverse mixed-motive settings while maintaining high collective return and competitive fairness.