🤖 AI Summary
This work addresses the tendency of utilitarian optimization in multi-agent reinforcement learning to yield unfair reward distributions and unequal leader-follower dynamics, while existing fairness-aware methods either lack theoretical guarantees or compromise the stationarity of Markov games. To bridge this gap, the paper introduces α-fair HATRPO and α-fair HAPPO—novel algorithms that embed tunable α-fairness within a trust-region framework endowed with rigorous theoretical safety guarantees. These methods employ a dynamically weighted fair advantage function to smoothly balance efficiency and fairness in the global objective, ensuring both monotonic policy improvement and convergence to Nash equilibria. Empirical results on social dilemma environments such as CleanUp and CommonHarvest demonstrate that the proposed approaches not only surpass the original HATRL in utilitarian performance but also significantly enhance both overall social welfare and individual fairness.
📝 Abstract
Cooperation in multi-agent systems is typically optimized through utilitarian objectives that maximize overall efficiency but fail to account for reward distribution, often resulting in inequitable "leader-follower" dynamics. While fairness-based approaches encourage pro-social behaviors where every agent benefits from cooperation, many current algorithms - including those utilizing reward shaping - break the stationarity of Markov Games or lack rigorous theoretical guarantees. This creates a critical gap between fair objective methods and theoretically safe learning frameworks. We propose a novel framework that bridges $α$-fairness with Heterogeneous-Agent Trust Region Learning (HATRL), ensuring monotonic improvement and convergence toward Nash Equilibria. Our approach leverages a fair advantage function that dynamically weights agent utilities based on their expected returns, allowing the global objective to transition from purely utilitarian efficiency to $α$-fairness welfare based on the parameter $α$. We introduce two practical algorithms, $α$-fair HATRPO and $α$-fair HAPPO, and demonstrate through experiments in sequential social dilemmas like CleanUp and CommonHarvest that they perform better than HATRL's algorithms from a utilitarian point of view while achieving socially higher outcomes.