Assigning Credit with Partial Reward Decoupling in Multi-Agent Proximal Policy Optimization

📅 2024-08-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In multi-agent reinforcement learning, scaling team size exacerbates credit assignment difficulty. To address this, we propose PRD-MAPPO—a novel extension of MAPPO that integrates Partial Reward Decoupling (PRD) for the first time into a centralized-critic, decentralized-execution framework. We design a tailored PRD variant compatible with shared reward settings and introduce a learnable attention mechanism to dynamically model inter-agent influence, enabling task-dependent adaptive subgroup formation and joint optimization. Empirical evaluation on benchmark multi-task domains—including StarCraft II—demonstrates that PRD-MAPPO significantly improves both sample efficiency and asymptotic performance over standard MAPPO and state-of-the-art methods. These results validate the efficacy of dynamic reward decoupling and structured credit assignment in large-scale cooperative MARL.

Technology Category

Application Category

📝 Abstract

Multi-agent proximal policy optimization (MAPPO) has recently demonstrated state-of-the-art performance on challenging multi-agent reinforcement learning tasks. However, MAPPO still struggles with the credit assignment problem, wherein the sheer difficulty in ascribing credit to individual agents' actions scales poorly with team size. In this paper, we propose a multi-agent reinforcement learning algorithm that adapts recent developments in credit assignment to improve upon MAPPO. Our approach leverages partial reward decoupling (PRD), which uses a learned attention mechanism to estimate which of a particular agent's teammates are relevant to its learning updates. We use this estimate to dynamically decompose large groups of agents into smaller, more manageable subgroups. We empirically demonstrate that our approach, PRD-MAPPO, decouples agents from teammates that do not influence their expected future reward, thereby streamlining credit assignment. We additionally show that PRD-MAPPO yields significantly higher data efficiency and asymptotic performance compared to both MAPPO and other state-of-the-art methods across several multi-agent tasks, including StarCraft II. Finally, we propose a version of PRD-MAPPO that is applicable to extit{shared} reward settings, where PRD was previously not applicable, and empirically show that this also leads to performance improvements over MAPPO.

Problem

Research questions and friction points this paper is trying to address.

Improves credit assignment in multi-agent reinforcement learning.

Introduces partial reward decoupling for agent subgrouping.

Enhances data efficiency and performance in shared reward settings.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learned attention mechanism credit assignment

Dynamic subgroup decomposition multi-agent

Shared reward settings PRD-MAPPO adaptation

🔎 Similar Papers

No similar papers found.

Authors to Follow