MA-VLCM: A Vision Language Critic Model for Value Estimation of Policies in Multi-Agent Team Settings

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of traditional centralized critics in multi-agent reinforcement learning, which suffer from low sample efficiency, poor generalization, and deployment challenges in resource-constrained heterogeneous robotic systems. The authors propose MA-VLCM, a novel framework that leverages a pre-trained vision-language model (VLM) as a training-free centralized critic to estimate state values by integrating natural language task descriptions, visual trajectories, and multi-agent states. This approach significantly improves sample efficiency and cross-environment generalization while enabling the generation of lightweight policies. Experimental results demonstrate that MA-VLCM achieves strong zero-shot return prediction performance in both in-distribution and out-of-distribution multi-agent scenarios and is compatible with various VLM backbones.

Technology Category

Application Category

📝 Abstract
Multi-agent reinforcement learning (MARL) commonly relies on a centralized critic to estimate the value function. However, learning such a critic from scratch is highly sample-inefficient and often lacks generalization across environments. At the same time, large vision-language-action models (VLAs) trained on internet-scale data exhibit strong multimodal reasoning and zero-shot generalization capabilities, yet directly deploying them for robotic execution remains computationally prohibitive, particularly in heterogeneous multi-robot systems with diverse embodiments and resource constraints. To address these challenges, we propose Multi-Agent Vision-Language-Critic Models (MA-VLCM), a framework that replaces the learned centralized critic in MARL with a pretrained vision-language model fine-tuned to evaluate multi-agent behavior. MA-VLCM acts as a centralized critic conditioned on natural language task descriptions, visual trajectory observations, and structured multi-agent state information. By eliminating critic learning during policy optimization, our approach significantly improves sample efficiency while producing compact execution policies suitable for deployment on resource-constrained robots. Results show good zero-shot return estimation on models with differing VLM backbones on in-distribution and out-of-distribution scenarios in multi-agent team settings
Problem

Research questions and friction points this paper is trying to address.

multi-agent reinforcement learning
centralized critic
sample efficiency
zero-shot generalization
vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language model
multi-agent reinforcement learning
centralized critic
zero-shot generalization
sample efficiency
🔎 Similar Papers
No similar papers found.
S
Shahil Shaik
Mechanical Engineering Department, Clemson University
A
Aditya Parameshwaran
Mechanical Engineering Department, Clemson University
Anshul Nayak
Anshul Nayak
Virginia Tech
Uncertainty QuantificationDeep LearningModel Predictive ControlReinforcement LearningRobotics
Jonathon M. Smereka
Jonathon M. Smereka
Researcher, U.S. Army DEVCOM GVSC
Computer VisionBiometricsMachine LearningRobotics
Y
Yue Wang
Mechanical Engineering Department, Clemson University