🤖 AI Summary
Addressing resource contention and communication constraints in task offloading for wireless edge networks, this paper proposes a decentralized multi-agent reinforcement learning (MARL) framework. Methodologically, we formulate the problem as a constrained Markov decision process (CMDP), introduce a dynamically updated shared constraint vector to enable lightweight implicit coordination, and integrate decentralized policy optimization with a sparse constraint-update mechanism—achieving global resource objective alignment under extremely low communication overhead. We theoretically establish convergence guarantees and enhance policy robustness via safe reinforcement learning. Experiments demonstrate that our approach significantly outperforms centralized and independent baselines in large-scale scenarios, effectively balancing local decision efficiency with global load balancing. The core contribution lies in replacing conventional explicit coordination with a scalable, low-communication-consumption constraint-sharing mechanism, enabling efficient and adaptive distributed control in resource-constrained edge environments.
📝 Abstract
In edge computing systems, autonomous agents must make fast local decisions while competing for shared resources. Existing MARL methods often resume to centralized critics or frequent communication, which fail under limited observability and communication constraints. We propose a decentralized framework in which each agent solves a constrained Markov decision process (CMDP), coordinating implicitly through a shared constraint vector. For the specific case of offloading, e.g., constraints prevent overloading shared server resources. Coordination constraints are updated infrequently and act as a lightweight coordination mechanism. They enable agents to align with global resource usage objectives but require little direct communication. Using safe reinforcement learning, agents learn policies that meet both local and global goals. We establish theoretical guarantees under mild assumptions and validate our approach experimentally, showing improved performance over centralized and independent baselines, especially in large-scale settings.