LAMARL: LLM-Aided Multi-Agent Reinforcement Learning for Cooperative Policy Generation

πŸ“… 2025-06-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the low sample efficiency and heavy reliance on handcrafted reward functions in multi-agent reinforcement learning (MARL) for collaborative robotic tasks, this paper proposes an LLM-driven MARL framework. Our method introduces the first fully automated construction of prior policies and learnable reward functions via large language models (LLMs), integrating Chain-of-Thought (CoT) structured prompting with a closed-loop MARL training pipeline to enable end-to-end, human-in-the-loop-free cooperative policy generation. The framework unifies LLMs, MARL algorithms, CoT-based prompt engineering, and robot control APIs. In shape assembly tasks, the LLM-derived prior policy improves sample efficiency by 185.9%; CoT-enhanced prompting combined with robot API integration increases LLM action-generation success rates by 28.5–67.5%; and both simulation and real-robot experiments validate the framework’s effectiveness and generalizability across diverse scenarios.

Technology Category

Application Category

πŸ“ Abstract
Although Multi-Agent Reinforcement Learning (MARL) is effective for complex multi-robot tasks, it suffers from low sample efficiency and requires iterative manual reward tuning. Large Language Models (LLMs) have shown promise in single-robot settings, but their application in multi-robot systems remains largely unexplored. This paper introduces a novel LLM-Aided MARL (LAMARL) approach, which integrates MARL with LLMs, significantly enhancing sample efficiency without requiring manual design. LAMARL consists of two modules: the first module leverages LLMs to fully automate the generation of prior policy and reward functions. The second module is MARL, which uses the generated functions to guide robot policy training effectively. On a shape assembly benchmark, both simulation and real-world experiments demonstrate the unique advantages of LAMARL. Ablation studies show that the prior policy improves sample efficiency by an average of 185.9% and enhances task completion, while structured prompts based on Chain-of-Thought (CoT) and basic APIs improve LLM output success rates by 28.5%-67.5%. Videos and code are available at https://windylab.github.io/LAMARL/
Problem

Research questions and friction points this paper is trying to address.

Enhances MARL sample efficiency via LLM integration
Automates policy and reward generation using LLMs
Improves multi-robot cooperation without manual tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates MARL with LLMs for efficiency
Automates policy and reward generation
Uses structured prompts to enhance outputs
πŸ”Ž Similar Papers
No similar papers found.
Guobin Zhu
Guobin Zhu
beihang university
reinforcement learningmulti-robot system
R
Rui Zhou
W
Wenkang Ji
WINDY Lab, Department of Artificial Intelligence, Westlake University, Hangzhou, China
S
Shiyu Zhao
WINDY Lab, Department of Artificial Intelligence, Westlake University, Hangzhou, China