🤖 AI Summary
This paper addresses the challenge of achieving consensus-based value iteration (VI) among multiple agents in decentralized reinforcement learning—without a central coordinator—using only local communication. We propose the nonparametric Bellman mapping (B-Map), which models the Q-function in a reproducing kernel Hilbert space (RKHS) and integrates distributed consensus optimization with graph signal processing. A novel covariance matrix sharing mechanism is introduced to propagate basis function structural information across the network. Theoretically, B-Map achieves linear convergence, with the optimal step size determined by the spectral ratio of the network Laplacian; it asymptotically matches the performance of centralized nonparametric VI. Empirical evaluation on two canonical control tasks demonstrates significant improvements over existing distributed RL methods. Counterintuitively, B-Map reduces total communication cost by employing higher-dimensional single-message exchanges—thereby empirically validating the critical role of basis structure information in accelerating collaborative learning.
📝 Abstract
This paper introduces novel Bellman mappings (B-Maps) for value iteration (VI) in distributed reinforcement learning (DRL), where multiple agents operate over a network without a centralized fusion node. Each agent constructs its own nonparametric B-Map for VI while communicating only with direct neighbors to achieve consensus. These B-Maps operate on Q-functions represented in a reproducing kernel Hilbert space, enabling a nonparametric formulation that allows for flexible, agent-specific basis function design. Unlike existing DRL methods that restrict information exchange to Q-function estimates, the proposed framework also enables agents to share basis information in the form of covariance matrices, capturing additional structural details. A theoretical analysis establishes linear convergence rates for both Q-function and covariance-matrix estimates toward their consensus values. The optimal learning rates for consensus-based updates are dictated by the ratio of the smallest positive eigenvalue to the largest one of the network's Laplacian matrix. Furthermore, each nodal Q-function estimate is shown to lie very close to the fixed point of a centralized nonparametric B-Map, effectively allowing the proposed DRL design to approximate the performance of a centralized fusion center. Numerical experiments on two well-known control problems demonstrate the superior performance of the proposed nonparametric B-Maps compared to prior methods. Notably, the results reveal a counter-intuitive finding: although the proposed approach involves greater information exchange -- specifically through the sharing of covariance matrices -- it achieves the desired performance with lower cumulative communication cost than existing DRL schemes, highlighting the crucial role of basis information in accelerating the learning process.