🤖 AI Summary
This work addresses the challenges of exponential joint action space growth and constraint coupling in constrained multi-agent reinforcement learning, which are difficult to model via reward shaping. The authors propose a novel approach that integrates coordination graphs with Lagrangian duality, decomposing the joint optimization into pairwise regions. A shared Q-function uniformly handles both objectives and constraints, while Max-Sum message passing enables coordinated action selection. Lagrange multipliers dynamically balance objective maximization and constraint satisfaction. Notably, the method decouples model complexity from the number of agents, generates the full Pareto frontier in a single training run, and provides convergence guarantees along with an interpretable compositional error bound. Empirical results on cooperative navigation tasks with up to ten agents demonstrate significant outperformance over fixed reward-shaping baselines and scalability to large-scale scenarios beyond the reach of centralized approaches.
📝 Abstract
Constrained Multi-agent reinforcement learning (CMARL) faces two intertwined challenges: the joint action space grows exponentially with the number of agents, and additional requirements couple agents in ways that reward structure alone does not capture. We introduce Coordination Graphs for Constrained Multi-Agent Reinforcement Learning (CG-CMARL), a framework that addresses both challenges by combining coordination graphs with Lagrangian duality. The system decomposes the joint problem into pairwise regions, each served by a set of shared Q-functions, one for the primary objective and one for each of the constraints, so that the number of learned models is independent of the number of agents. At execution time, Max-Sum message passing coordinates actions across the factor graph, while a Lagrangian multiplier controls the objective--constraint tradeoff, allowing a single trained model to trace a Pareto front without retraining. We provide convergence guarantees under mild conditions, together with a compositional error bound that decomposes into separate interpretable sources, each traceable to a specific design choice and independently controllable. Experiments on cooperative navigation tasks (where teams of up to 10 agents must coordinate to reach target positions while satisfying pairwise constraints) show that our method produces Pareto fronts dominating established baselines trained at fixed reward-shaping ratios, while scaling to team sizes where centralized approaches become intractable.