🤖 AI Summary
This paper addresses the joint optimization of fairness and system utility in multi-agent multi-armed bandits (MA-MAB) under limited arm reward information. We propose the first fairness-aware active exploration framework: in the offline phase, we design a greedy algorithm with theoretical guarantees leveraging submodularity; in the online phase, we achieve sublinear regret under explicit fairness constraints. Our method integrates submodular optimization, stochastic exploration strategies, and multi-agent online learning to jointly model individual fairness and global utility. Experiments on synthetic and real-world datasets demonstrate up to a 37% improvement in fairness metrics and a 22% increase in cumulative reward, while strictly satisfying the sublinear regret bound—significantly outperforming existing baselines.
📝 Abstract
We propose a multi-agent multi-armed bandit (MA-MAB) framework aimed at ensuring fair outcomes across agents while maximizing overall system performance. A key challenge in this setting is decision-making under limited information about arm rewards. To address this, we introduce a novel probing framework that strategically gathers information about selected arms before allocation. In the offline setting, where reward distributions are known, we leverage submodular properties to design a greedy probing algorithm with a provable performance bound. For the more complex online setting, we develop an algorithm that achieves sublinear regret while maintaining fairness. Extensive experiments on synthetic and real-world datasets show that our approach outperforms baseline methods, achieving better fairness and efficiency.