Vision-Based Generic Potential Function for Policy Alignment in Multi-Agent Reinforcement Learning

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multi-agent reinforcement learning, aligning agent policies with human commonsense knowledge faces challenges including difficult reward specification and poor generalization to long-horizon tasks. This paper proposes a hierarchical vision-driven potential-based reward shaping method: at the lower level, a vision-language model (VLM) is directly instantiated as a universal potential function to encode cross-scenario commonsense; at the upper level, a vision-augmented large language model (vLLM) dynamically selects context-appropriate potential functions and enables adaptive skill selection via video replay and training log analysis. We theoretically prove that the proposed shaping preserves optimal policies. Evaluated in the Google Research Football environment, our method significantly improves win rates, and human evaluation confirms strong alignment between learned policies and human commonsense. The approach establishes a scalable, interpretable paradigm for commonsense alignment in multi-agent systems.

Technology Category

Application Category

📝 Abstract
Guiding the policy of multi-agent reinforcement learning to align with human common sense is a difficult problem, largely due to the complexity of modeling common sense as a reward, especially in complex and long-horizon multi-agent tasks. Recent works have shown the effectiveness of reward shaping, such as potential-based rewards, to enhance policy alignment. The existing works, however, primarily rely on experts to design rule-based rewards, which are often labor-intensive and lack a high-level semantic understanding of common sense. To solve this problem, we propose a hierarchical vision-based reward shaping method. At the bottom layer, a visual-language model (VLM) serves as a generic potential function, guiding the policy to align with human common sense through its intrinsic semantic understanding. To help the policy adapts to uncertainty and changes in long-horizon tasks, the top layer features an adaptive skill selection module based on a visual large language model (vLLM). The module uses instructions, video replays, and training records to dynamically select suitable potential function from a pre-designed pool. Besides, our method is theoretically proven to preserve the optimal policy. Extensive experiments conducted in the Google Research Football environment demonstrate that our method not only achieves a higher win rate but also effectively aligns the policy with human common sense.
Problem

Research questions and friction points this paper is trying to address.

Multi-agent reinforcement learning policy alignment
Vision-based reward shaping method
Hierarchical adaptive skill selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-based reward shaping method
adaptive skill selection module
visual-language model potential function
🔎 Similar Papers
No similar papers found.
H
Hao Ma
School of Artificial Intelligence, University of Chinese Academy of Sciences; Institute of Automation, Chinese Academy of Sciences
S
Shijie Wang
School of Artificial Intelligence, University of Chinese Academy of Sciences; Institute of Automation, Chinese Academy of Sciences
Z
Zhiqiang Pu
School of Artificial Intelligence, University of Chinese Academy of Sciences; Institute of Automation, Chinese Academy of Sciences
S
Siyao Zhao
School of Artificial Intelligence, University of Chinese Academy of Sciences; Institute of Automation, Chinese Academy of Sciences
Xiaolin Ai
Xiaolin Ai
Institute of Automation, Chinese Academy of Sciences
multi-agent systems