CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Large language models still face significant challenges in embodied interaction and behavioral execution assessment within authentic, immersive human-agent collaboration. To address this, this work proposes CollabBench, the first collaborative agent training and evaluation framework that integrates diverse player personas, active participation mechanisms, and hybrid reward structures, thereby transcending the limitations of conventional dialogue-level collaboration. Built upon extended CWAH-MultiPlayer and Cook-MultiPlayer environments, the framework unifies reasoning, communication, and action through a behavior simulation pipeline, agent rollout training, and a hybrid reward mechanism. Experimental results demonstrate that the proposed approach improves task efficiency and emotional adaptability by 19.5% and 24.4%, respectively, substantially outperforming baseline methods, while also revealing critical deficiencies of current large models in collaborative settings.

📝 Abstract

While LLM-based agents excel at individual tasks, effective collaboration with realistic human partners remains challenging. Most of the existing conversation-level collaborative studies lack grounded interaction and behavioral execution, motivating the need for cooperative game environments that enable contextualized and immersive collaboration. To this end, this paper proposes CollabBench, a benchmark for evaluating and training collaborative agents in cooperative games. CollabBench features a Diverse Player Profile Simulation pipeline to model varied players behaviors, and a Collaborative Agentic Training paradigm that unifies reasoning, communication, and action via agentic rollouts, optimized with a hybrid reward balancing task efficiency and affective adaptation. We further extend classic environments to CWAH-MultiPlayer and Cook-MultiPlayer for systematic evaluation under diverse personalities. Experiments with efficiency and affective metrics show that our trained models outperform base models, achieving 19.5% higher efficiency and 24.4% improved affective performance. Further analysis reveals key collaborative limitations of existing models and offers insights for future collaborative training.

Problem

Research questions and friction points this paper is trying to address.

collaborative ability

large language models

human-AI collaboration

cooperative games

diverse player behaviors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Collaborative Benchmarking

Diverse Player Simulation

Agentic Rollouts